THE NLP PROJECT

Text Encoding for Beginners

In this second part of the series called ‘The NLP Project’, you will understand how text is stored on machines.

Ishan Singh
Artificial Intelligence in Plain English
3 min readApr 27, 2020

--

Photo by Andrew Seaman on Unsplash

Here is the link to Part-1 of ‘The NLP Project’.

Now, it is not necessary that when you work with text, you have to work with the English language. Since many languages ​​of the world and the Internet have been accessed by many countries, there is a lot of text in languages ​​other than English. In order for you to work with text other than English, you need to understand how all the other characters are stored.

Computers can directly manipulate numbers and store them in registers (the smallest unit of memory in the computer). But they cannot store non-numeric characters. Alphabets and special characters must first be converted to a numeric value before being stored.

Therefore, the concept of encoding came into being. All non-digit characters are encoded into numbers using the code. Also, different computer manufacturers need to standardize encoding methods so that different encoding methods are not used.

The first encoding standard that came into existence was the ASCII (American Standard Code for Information Interchange) standard, in 1960. For example, the ASCII code for the alphabet ‘A’ is 65 and the digit zero is 48. Since then, many modifications have been made to the code to accommodate new characters that have come into the existence since the initial encoding.

When ASCII was created, the only letter on the keyboard were the English alphabet. Over time, new languages ​​have begun to appear on keyboard sets that brings new characters. ASCII is old and does not support many languages. In recent years a new standard has come into being — the Unicode standard. It supports all the languages ​​of the world — modern and old.

Before you begin any text processing, you need to know what type of encoding is present and if necessary, modify it to a different encoding format.

ENCODING STANDARD

There are two widely used encoding standards:

  1. American Standard Code for Information Interchange (ASCII)
  2. Unicode

UTF-8 provides great advantage when the character is a character from the English alphabet or ASCII character set. Also, while UTF-8 uses only 8 bits to store the character, UTF-16 (BE) uses 16 bits to store it, which is a waste of memory.

But, in the case when a symbol is used which doesn’t appear in the ASCII character set, UTF-8 uses 24 bits, while UTF-16 (BE) uses only 16 bits. So the storage advantages offered by UTF-8 are actually negative and have become a drawback here. Also, the previously provided UTF-8, similar to the ASCII code, is not useful here, as the ASCII code does not even exist in this case.

Unicode UTF-8 is the default encoding for strings in Python. You can also check UTF-8 encoder-decoder to see how the string is stored. Notice that the online tool gives you the hexadecimal code of a given string.

You can also try out this particular code snippet in your Python IDE -

#create a string using non English character 
amount = u"₹50"
print('Default string: ', amount, '\n', 'Type of string', type(amount), '\n')
#encode to UTF-8 byte format
amount_encoded = amount.encode('utf-8')
print('Encoded to UTF-8: ', amount_encoded, '\n', 'Type of string', type(amount_encoded), '\n')
#decode from UTF-8 byte format
amount_decoded = amount_encoded.decode('utf-8')
print('Decoded from UTF-8: ', amount_decoded, '\n', 'Type of string', type(amount_decoded), '\n')

That’s it for today folks. I am happy to hear any questions or feedback.

A note from AI In Plain English

We are always interested in helping to promote quality content. If you have an article that you would like to submit to any of our publications, send us an email at submissions@plainenglish.io with your Medium username and we will get you added as a writer.

--

--