Why Unicode is amazing - Oliver Booth

Lately I've found myself engrossed in the inner workings of Unicode and its associated UTF encodings. Today I'm going to talk about them, and I hope to get you as excited as I am about it, because frankly the solutions that the Unicode Consortium has come up with are genius.

Note

I'm aware that Tom Scott starred in a Computerphile video about UTF-8. While it's a perfectly good video and I recommend you watch it, I'm going to be going into a lot more detail than he did.

Also, I'm going to be talking about UTF-16 and UTF-32 which wasn't covered in that video. So enjoy!

The history of ASCII

To cut a long story short: different countries landed on different ways to encode text because different languages require different characters. Because they're different. Initially, this wasn't much a problem as sharing files between different computers - especially ones located across the globe - was very uncommon, especially in the days before the web.

The US and the rest of the English-speaking world settled on a standard known as the American Standard for Information Interchange, also known as ASCII. ASCII is a 7-bit encoding meaning that it uses 7 bits to represent each character. With 7 bits, there are 128 possible characters that can be represented.

The first 32 characters are non-printable control characters, which includes things like backspace, newline, and I swear to god, a character named BEL to make the computer beep or otherwise make its native notification sound.

The remaining 96 characters are “printable”, which includes the English alphabet in both uppercase and lowercase, the digits 0-9, and some other useful punctuation and related symbols.

ASCII was a perfectly good standard for its day, and for the English speaking world it pretty much did the job, but the rest of the world was not so fortunate. When 8-bit computing finally hit widespread use, an extended version of ASCII was introduced (creatively named Extended ASCII).

Extended ASCII used the 8th bit to add an additional 128 characters, but the problem was that there were many different versions of Extended ASCII in use by different countries, and they were all entirely incompatible with each other. And when you look at how other languages that use non-Latin characters (such as Japanese, Korean, Arabic, Hebrew) encoded their characters, it was a complete mess. There was no standard way to encode text.

Enter Unicode

Unicode was developed to solve the problem of having multiple incompatible character encodings. The folks at the Unicode Consortium set out to create a standard that would be able to represent every character in every language in the world, and it even includes characters from languages that we don't even know how to translate.

It's often believed that Unicode and UTF are the same thing, however it's important to note that Unicode is not an encoding itself. Unicode does not specify how these characters are represented in memory. It's simply a standard that defines a mapping of each and every character to a unique number. This number is known as a code point. For example:

The character E is mapped to the number 69, or U+0045.
The character ඞ is mapped to the number 171, or U+0D9E.
The character 🤣 is mapped to the code point 129315, or U+1F923.

Yes, emojis are part of Unicode!

What's the U+xxxx notation?

This is nothing more than simple convention. Unicode uses this so that people know when you're talking about a Unicode codepoint, rather than a codepoint in another character set.

The digits after the + represent the hexadecimal value of the code point. So U+0045 just means 45₁₆, which is the hexadecimal representation of the number 69₁₀.

UTF-8

UTF-8 is a variable-length encoding that was designed to be backwards compatible with ASCII. It's the most common encoding used on the web today, and it's the default encoding used by most programming languages.

So how do we encode a given Unicode character's codepoint? Well, let's start with an easy one. We'll take the letter E, with a Unicode code point of U+0045.

The first thing we do is check if the codepoint fits inside a single UTF-8 byte. See, UTF-8 tries to encode characters in at least 1 byte (8-bits, hence the name UTF-8), but the most significant bit (MSB) has a reserved meaning which we'll talk about shortly. Below is a table of how many bytes are required to encode a given code point:

Code point range	Byte length
U+0000 - U+007F	1
U+0080 - U+07FF	2
U+0800 - U+FFFF	3
U+10000 - U+10FFFF	4

Thankfully, 69 is indeed less than 128, so in this case we just shove that value as the byte and call it a day. So the letter E is encoded as 0x45. Job done.

Encoding multi-byte characters

Now let's try another! Let's take the letter é, with a Unicode code point of U+00E9. This one is a little bit more complicated because E9₁₆ is greater than 7F₁₆, so how do we proceed?

First we need to get the binary representation of the code point. In this case, it's 11101001₂. Remember this number. The colours will make sense in a moment.

Next, we use the table above to see that it's in the range U+0080 - U+07FF, so we know that it's going to be a 2-byte sequence. So let's start out by creating some space for those bytes (using x to represent bits that we don't know yet):

xxxxxxxx xxxxxxxx

For the first byte in the character, we start out by writing as many 1s as there are bytes for the character. In this case, it's 2 bytes, so we simply write two 1s. We then write a 0 to signal the end of the count. So our data so far looks like this:

110xxxxx xxxxxxxx

Each subsequent byte starts with 10 to indicate that it's a continuation byte. So our data now looks like this:

110xxxxx 10xxxxxx

And now we can begin to fill in the xs with our codepoint. Remember that binary representation of the codepoint? It was 11101001₂. So we simply fill in the gaps, starting from the right:

110xxx<mark style="background:transparent!important;color:#F44336!important">11</mark> 10<mark style="background:transparent!important;color:#4CAF50!important">101001</mark>

Finally, any remaining bits are filled in with 0s:

110000<mark style="background:transparent!important;color:#F44336!important">11</mark> 10<mark style="background:transparent!important;color:#4CAF50!important">101001</mark>

Now we have our two bytes! 11000011₂ is C3₁₆, and 10101001₂ is A9₁₆, so our final UTF-8 bytes for the character é are 0xC3 0xA9. The colours should also help you see how the bits are distributed across the bytes.

This also answers the question of what I meant by the most significant bit having a “reserved meaning”. If the first bit is 0, then it's a single-byte character which is compatible with ASCII. If the first bit is 1, then it's part of a multi-byte character. If the second bit is 1, then it's the first byte, and if it's 0 then it's a continuation byte to hold the remaining bits of the character that didn't fit. This means you'll never see a UTF-8 character start with 10 because that's not a valid header for the first byte!

That might have seemed like a lot, and at first glance it can seem a bit overwhelming. But honestly, this is such a genius solution and it's remarkable how well it works.

Here's an example of a C# implementation for manually encoding a codepoint as UTF-8:

byte[] EncodeAsUtf8(int codepoint)
{
    if (codepoint < 0x80)
    {
        return new byte[] { (byte)codepoint };
    }

    if (codepoint < 0x800)
    {
        return new byte[]
        {
            (byte)(0xC0 | (codepoint >> 6)),
            (byte)(0x80 | (codepoint & 0x3F))
        };
    }

    if (codepoint < 0x10000)
    {
        return new byte[]
        {
            (byte)(0xE0 | (codepoint >> 12)),
            (byte)(0x80 | ((codepoint >> 6) & 0x3F)),
            (byte)(0x80 | (codepoint & 0x3F))
        };
    }

    return new byte[]
    {
        (byte)(0xF0 | (codepoint >> 18)),
        (byte)(0x80 | ((codepoint >> 12) & 0x3F)),
        (byte)(0x80 | ((codepoint >> 6) & 0x3F)),
        (byte)(0x80 | (codepoint & 0x3F))
    };
}

You can play around with this here!

UTF-16

UTF-8 encodes characters into at least 1 byte (8 bits), and so it reasonably follows that UTF-16 encodes characters into at least 2 bytes (16 bits)! See? The names are starting to make sense now.

UTF-16 is actually significantly simpler. And by simpler I mean it involves fewer steps, but truth be told it actually involves a little bit more math.

For all characters that fit inside 2 bytes (i.e. any values less than U+FFFF), UTF-16 simply uses that value padded with zeroes. So for example, the letter E which has the codepoint U+0045, is simply encoded as 0x0045. The letter ç which has the codepoint U+00E7 is encoded as 0x00E7. This is known as the Basic Multilingual Plane (BMP). Not all characters fit into the BMP, though. Once you start getting into the higher codepoints, you need to use a technique known as surrogate pairs. This is where things get a little bit more complicated.

Let's use the character 🜊 to demonstrate how this works. This character is named “Alchemical Symbol For Vinegar” and has the codepoint U+1F70A. This is obviously greater than U+FFFF, so how do we encode it?

First, we need to subtract 0x10000 from the codepoint. This gives us 0xF70A. Next we create the first (“high”) surrogate by bit-shifting the value right 10 places and adding 0xD800.

0xF70A >> 10  = 0x3D
0x3D + 0xD800 = 0xD83D

For the second (“low”) surrogate, we take the lower 10 bits and add 0xDC00.

0xF70A & 0x3FF = 0x30A   // 0x3FF is 0b1111111111 in binary, 10 bits
0x30A + 0xDC00 = 0xDF0A

Cramming these two values together gives us 0xD83D 0xDF0A, which is the UTF-16 encoding for the character 🜊.

Here's an example of a C# implementation for manually encoding a codepoint as UTF-16:

byte[] EncodeAsUtf16(int codepoint)
{
    byte[] utf16Bytes;

    if (codepoint <= 0xFFFF)
    {
        // code point fits in 16 bits, so it's a BMP character
        utf16Bytes = new byte[2];
        utf16Bytes[0] = (byte)codepoint;
        utf16Bytes[1] = (byte)(codepoint >> 8);
    }
    else
    {
        // code point is outside the BMP, so it's a surrogate pair
        int codepointPrime = codepoint - 0x10000;
        int highSurrogate = ((codepointPrime >> 10) & 0x3FF) + 0xD800;
        int lowSurrogate = (codepointPrime & 0x3FF) + 0xDC00;

        utf16Bytes = new byte[4];
        utf16Bytes[0] = (byte)highSurrogate;
        utf16Bytes[1] = (byte)(highSurrogate >> 8);
        utf16Bytes[2] = (byte)lowSurrogate;
        utf16Bytes[3] = (byte)(lowSurrogate >> 8);
    }

    return utf16Bytes;
}

You can play around with this one here! The bytes may be in reverse order, but that's just because of endianness. I'll be going into detail about endianness in a future post.

UTF-32

UTF-32 is the simplest of all the UTF encodings. It simply encodes each codepoint as a 32-bit integer! There are no special cases, no bit-shifting, no nothing. You just take the codepoint as a 32-bit integer and call it a day. So the letter E is encoded as 0x00000045, the letter ç is encoded as 0x000000E7, and the character 🜊 is encoded as 0x0001F70A.

Conclusion

I hope you enjoyed this post! I know it was a bit of a long one, but I hope you found it interesting because I know I did. Unicode is such a fascinating and genius solution to a problem that so many people take for granted.

Until next time!