Unicode in Python

History of Character Codes

ASCII

In 1968, ASCII is standardized which defined numeric codes for various characters, with the numeric values running from 0 to 127. These characters contain the numbers 0-9, the letters a-z and A-Z, some basic punctuation symbols, some control codes that originated with Teletype machines, and a blank space - into the 7-bit binary integers. For the 8-bit PCs, bytes could hold values ranging from 0 to 255, but ASCII only went up to 127. So the ASCII was extended to hold more characters (128 ~ 255) (i.e. accented characters).

But 255 characters isn't still enough to hold various characters in the world (i.e. accented characters used in Western Europe and the Cyrillic alphabet used for Russian)

Unicode
Unicode started out using 16-bit characters instead of 8-bit characters to solve the problem above. 16 bits means that you can hold 65,536 distinct characters. Unfortunately, It turns out that even 16 bits isn't still enough to hold all of characters. So the modern Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in base-16).

Definitions

The Unicode standard describles how characters are represented by code points [1]. A Unicode string is a sequence of code points, which are numbers from 0 to 0x10ffff. This sequence needs to be represented as a set of bytes(meaning, values from 0-255) in memory. So this refers to a concept called encoding.

decoding/encoding
The process of converting the Unicode string (a sequence of code points) to a sequence of bytes is encoding, while the reverse process is encoding.

UTF-8 is the most commonly supported encoding probably. Encodings don't have to handle every possible Unicode character, and most encodings don't. UTF-8 uses the following rules:

# if the code point is < 128, it's represented by the corresponding byte value.

# if the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.

[1]Code point is an integer value, usually denoted in base 16 which generally represents single character. A code point is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal).

blogroll

social