Basics of Unicode and Encodings

Unicode

What is Unicode?
Its just a code point assigned to a letter from every alphabet.
Every letter in every alphabet is assigned a unique number by the Unicode consortium which is represented as- U+0041.  This unique number is called a code point. The latest version of Unicode contains a repertoire of more than 120,000 characters covering 129 modern and historic scripts, as well as multiple symbol sets.

Examples of some Unicode code points:-

Character Code point
A U+0041
D U+0044
a U+0061
U+0905
U+0915
Ω U+03A9

The U+ means “Unicode” and the numbers are hexadecimal.

 

Encodings

Encodings define a way to store the code points in memory or on disk. Code point is just a theoretical concept and how that code point is represented in memory or on disk is defined by the encodings. Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 and UTF-16.

If you see ????? in your content, then it is a possibility that you are interpreting the content using wrong encoding.

If you don’t know whether a particular string is encoded using UTF-8 or UTF-16 or ISO 8859-1 , you cannot display it correctly.
For an email message, you are expected to have a string in the header of the form:-

 Content-Type: text/plain; charset="UTF-8"

For a HTML file it is expected to put the Content-Type of the HTML file in the meta tag so that when the web browsers see this tag, it stops parsing the page and start over after reinterpreting the whole page using the encoding you specified.:

<html>
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Meta tag has to be the very first thing in the <head> section.
If the browser don’t find any Content-Type, either in the http headers or the meta tag, then it tries to guess the encoding.

 

UTF-8

UTF-8 is a character encoding capable of encoding all code points in Unicode. The encoding is variable-length and uses 8-bit code units. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, up to 6 bytes. Ascii text looks the same in utf-8.

Examples of some Unicode code points and their UTF-8 representations:-

Character Code point UTF-8 bytes
A U+0041 41 1
D U+0044 44 1
a U+0061 61 1
U+0905 E0 A4 85 3
U+0915 E0 A4 95 3
Ω U+03A9 CE A9 2

In UTF-8,
code points (U+0000 to U+007F) take 1 byte
code points (U+0080 to U+07FF) take 2 bytes
code points (U+0800 to U+FFFF) take 3 bytes
code points (U+10000 to    U+1FFFFF) take 4 bytes
code points (U+200000 to U+3FFFFFF) take 5 bytes
code points (U+4000000 to U+7FFFFFFF) take 6 bytes

 

UTF-16

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all code points in Unicode. The encoding is variable-length and the code points are encoded with one or two 16-bit code units.

Examples of some Unicode code points and their UTF-16 representations(BIG Endian):-

Character Code point UTF-16 bytes
A U+0041 00 41 2
D U+0044 00 44 2
a U+0061 00 61 2
U+0905 09 05 2
U+0915 09 15 2
Ω U+03A9 03 A9 2

U+0000 to U+D7FF and U+E000 to U+FFFF
Code points in this range are encoded as a single 16-bit code units that are numerically equal to the corresponding code points.

U+10000 to U+10FFFF
Code points in this range are encoded as two 16-bit code units, taking 4 bytes.

 

UTF-32

UTF-32 (Unicode Transformation Format 32 bits) is a protocol to encode Unicode code points that uses exactly 32 bits per Unicode code point. This makes UTF-32 a fixed-length encoding, in contrast to all other Unicode transformation formats which are variable-length encodings. The UTF-32 form of a code point is a direct representation of that code point’s numerical value.

Examples of some Unicode code points and their UTF-32 representations(BIG Endian):-

Character Code point UTF-32 bytes
A U+0041 00 00 00 41 4
D U+0044 00 00 00 44 4
a U+0061 00 00 00 61 4
U+0905 00 00 09 05 4
U+0915 00 00 09 15 4
Ω U+03A9 00 00 03 A9 4

 

Endianness

Endianness refers to the order of storing bytes in computer memory. Words may be represented in big-endian or little-endian format. With big-endian the most-significant byte of a word is stored at the memory location with the lowest address and the least significant byte stored at the highest memory address.

With little-endian format the least-significant byte is stored at the lower memory address with the most significant byte is stored at the highest memory address.

For example the bytes representing the character अ in UTF-16 big-endian encoding will be stored as:
09 05

The bytes representing the character अ in UTF-16 little-endian encoding will be stored as:
05 09

The first location being lower memory address and second location being higher memory address.

The Intel x86 and x86-64 series of processors use the little-endian format, and for this reason, the little-endian format is also known in the industry as the “Intel convention”.

 

Byte Order Mark (BOM)

Unicode code points can be encoded as 8-bit, 16-bit, or 32-bit integers. For the 16-bit and 32-bit representations, a computer receiving text from arbitrary sources needs to know the byte order to read the data properly.
In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness of all the 16-bit code units of the file or stream.

The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format as shown in the below table.

Bytes Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

 

UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order as UTF-8 has 1 byte(8 bits) code units.

For UTF-16, byte order is determined by a byte order mark, if present at the beginning of the data stream, otherwise it is big-endian.

 

Example

Lets consider the word – Hello
This can be represented by using the Unicode code points as –
U+0048   U+0065   U+006C   U+006C   U+006F

Storing this in UTF-8 will take 5 bytes as shown –
48 65 6C 6C 6F

Storing this in UTF-16 big endian will take 10 bytes  + bytes for Unicode Byte order mark FEFF(if BOM is present)-
00 48 00 65 00 6C 00 6C 00 6F

Storing this in UTF-16 little endian will take 10 bytes  + bytes for Unicode Byte order mark FFFE(if BOM is present)-
48 00 65 00 6C 00 6C 00 6F 00

Storing this in UTF-32 big endian will take 20 bytes  + bytes for Unicode Byte order mark FEFF(if BOM is present)-
00 00 00 48 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F

Storing this in UTF-32 little endian will take 20 bytes  + bytes for Unicode Byte order mark FFFE(if BOM is present)-
48 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00

 

References

https://en.wikipedia.org/wiki/UTF-8
https://en.wikipedia.org/wiki/UTF-16
https://en.wikipedia.org/wiki/UTF-32
http://www.joelonsoftware.com/articles/Unicode.html
http://unicode.org/faq/utf_bom.html

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *