Unicode Note

on March 23, 2012

Unicode defines a total of 1,114,112 code points (a non-negative integer representing an abstract character) in the range of 0x0 to 0x10FFFF, with each code point corresponding to a symbol. The latest version of Unicode 6.0 defines 109,000 code points.

Unicode code points are divided into 17 planes with 65,536 code points in each plane. The first plane with code points from 0x0000 to 0xFFFF, is called Basic Multilingual Plane, or BMP.

To store a Unicode code point we have to encode its value to some binary forms. There are many ways to encode Unicode code points, but since there are more than 256 code points, we need more than one byte (8 bits) to encode most Unicode code points:

UCS-2 only encodes code points in BMP with a fixed-length scheme. Each code point is represented by two bytes (16 bits), which leads to the confusion about byte order (endianness). UCS-2 is defined to be big-endian only, but most software default to little-endian and treat endianness just like UTF-16 with BOM.

UCS-4 uses four bytes (32 bits) to encode a code point, which is sufficient to represent all possible Unicode code point. UCS-4 is equivalent to UTF-32.

UTF-16 is basically UCS-2 plus the ability to use two 16-bit code units (a surrogate pair) to encode code points in planes other than BMP (which means these code points are higher than 0xFFFF). Unicode reserves code point 0xD800 to 0xDFFF to represent surrogate pairs, which means only 0x0000 to 0xD7FF and 0xE000 to 0xFFFF have meanings in UCS-2.

UTF-16 encoded files usually come with a byte-order mark (BOM) to indicate its endianness. If BOM is omitted, it is usually assume to be big-endian (UTF-16BE), but due to the fact that Windows uses little-endian (UTF-16LE) by default, some applications might assume little-endian.

Windows NT (before Windows 2000) internally uses UCS-2 to store Unicode characters because UTF-16 was not ready at that time. Windows 2000, XP, and later versions of Windows uses UTF-16 as its internal representation.

OS X Cocoa and Core Fundation, Qt, and .NET framework also uses UTF-16.

Java uses UCS-2 before J2SE 5.0. After that it uses UTF-16, but with the caveat that surrogate pairs for code points higher than BMP must be entered individually.

Python 2.X internally uses either UCS-2 or UCS-4 to represent Unicode characters. On Windows and OS X, Python defaults to use UCS-2. One Linux and other variants of Unix, Python defaults to use UCS-4. There is a compile-time option (--with-wide-unicode) to change the defaults. Note that Python with UCS-4 and Python with UCS-2 are binary incompatible.

Python 3.3+ uses a new Unicode string representation. The compile time option to choose UCS-2/UCS-4 is gone. Python will use the most space-efficient method to represent strings based on the highest code point in the string: 1 byte per code point for pure ASCII strings, 2 bytes per code point for BMP strings, and 4 bytes per code point for non-BMP strings. Details can be found in PEP 393.