"Why is this night different from all other nights?" "Why do good things happen to bad people?" "What is UTF-8?" For centuries, people have been asking fundamental questions about the nature of the universe and the purpose of life. Here we address the third question, "What is UTF-8?" In The Beginning, there was ASCII. ASCII is both a character<->number mapping *and* an encoding system for representing those numbers. Note that these two things can be decoupled, it just happens that ASCII does not decouple them. (In a moment, we'll meet a system that does decouple them.) In ASCII, a single space character has the number 32, *and* is represented by an eight-bit byte 00100000. The letter 'A' is 65 (01000001). Etc, etc. In classical ASCII, the eighth bit is never set. ASCII values only go up to 127 (01111111), which is the DEL character. Note, while we're here, that all the "control characters" in ASCII are <= 127. They all live in 7-bit ASCII. Originally, the eighth bit was used by some communications protocols, since it was unused as far as ASCII encoding was concerned. Later these protocols got sane and started using out-of-band data for flow control, and the eighth bit became available for encoding purposes. This was good timing, because Europe badly needed some more characters, for vowels with umlauts and stuff. So various European conventions arose in the upper ASCII regions. These are the ISO-8859 or "Latin" family, the most common being ISO-8859-1 (Latin-1). Naturally, these could represent only 127 more unique characters than lower ASCII. Even this was not enough to cover all European languages, let alone world languages, however. Enter Unicode. Unicode made an interesting decision. They said "We want to be able to encode an arbitrary number of characters." So they decoupled numeric value from representation. In Unicode, each character has a number (called a "Unicode code point"). How you represent that number depends on which encoding of Unicode you use. The number *could* be 57879302763479238192374 -- there's no reason Unicode couldn't have a character with that value. Of course, they haven't gotten that high yet, but you never know. Anyway, in Unicode, all the interesting action happens in the representations, a.k.a. the encodings. One simple encoding is UTF-32, which uses 4 bytes for every character. Sometimes it is called "UCS-4" instead of "UTF-32". "UCS" gives the number of bytes, whereas "UTF" gives the number of bits. I'll stick with "UTF" in this document. UTF-32 is a fixed-width encoding, because it has enough bits per char to represent all Unicode code points (they're up to 1114111 as of this writing). "Fixed-width" means if you have a char array, you can jump to the N'th char in constant time, because you know exactly how far it is. In UTF-32, you'd jump 4*N bytes into the array. The two-byte encoding UTF-16 is in more common use than UTF-32 these days, however. UTF-16 is still fixed-width for most practical purposes: since almost all characters in all modern languages are below 65535, they can fit in just two bytes. But both UTF-32 and UTF-16 have some problems: 1) They waste a lot of space if you're just writing English text, because you use many more bits per char as you really need. 2) They're not compatible with a lot of existing code, because they have all-zero-bit bytes, which look like end-of-string markers to C, for example. Enter UTF-8, "The Beautiful Encoding". UTF-8 works like this: Characters can occupy anywhere from 1 to 4 bytes. That's right, it's not fixed width. You can't instantly jump to the Nth char in an array -- they just gave up on that. You can jump to the Nth *byte* in an array, but that byte might be in the middle of a character, and you wouldn't even know how many complete characters you'd skipped over to get there (though you would at least have upper and lower bounds on that number). In UTF-8, all classical (7-bit) ASCII chars are represented as themselves. If you see a byte in a UTF-8 string, and its 8th-bit is zero, then you can just interpret it as that ASCII char. You don't need to know anything else about the context. The bytes on either side of it do not matter a whit. Note that this works smoothly because the Unicode code points for these characters match their ASCII values. That is, ASCII 0-127 match Unicode 0-127 in their *values*. (It would be meaningless to make that claim about their *representations*, of course, since Unicode itself does not specify the representation. But UTF-8 does specify the representation.) For all Unicode code points from 128-2047, UTF-8 uses two bytes. The first byte has its upper two bits (8th and 7th) set to one, the 6th bit set to zero, and then the remaining 5 bits hold part of the character value. The second byte has its 8th bit set but its 7th bit clear, and the remaining bits hold the rest of the value. I could tell you the rules for a three byte character, but you can already see what's going on here. In UTF-8, the rules for any given byte are consistent: 1) If the 8th bit is clear, it's a one-byte char, see ASCII. 2) If 8th and 7th are set, it's the first byte of a multibyte character. 3) If 8th is set but 7th is clear, it's a continuation (i.e., non-leading) byte in a multibyte character. I won't bother you with the complete rules for all multibyte chars. I think it's something like: if 8th, 7th, and 6th are set, but 5th is clear, it's the first byte of a three-byte character, and so on. In other words, the number of bits set immediately after the 8th indicates how many continuation bytes the character has. Thus, UTF-8 could go up to 6 bytes or something (but in practice, I think the upper end of that range is unused, and they've limited it to 4 bytes for now). The important thing is, you can pick up a UTF-8 stream in the middle and get your bearings immediately. If you pick up the stream in the middle of a character, you won't know what that character was of course, but you will be able to tell that you were in the middle of *some* character, because you'll recognize case (2) above, and therefore you can reset your stream at the start of the next character, which you will be able to unambiguously recognize according to the above rules. Note also that UTF-8 doesn't have any all-zero bytes, except for ASCII zero (Ctrl-A), which of course, could just as easily appear in an ASCII string as a UTF-8 string. So UTF-8 is as compatible with C string functions as ASCII is. Of course, UTF-8 doesn't encode, say, Chinese as space-efficiently as UTF-16 or UTF-32 does. It uses up extra bits on meta-data, so it requires three bytes for every Chinese character instead of two. I say who cares. It's still an amazing compatibility win. So, now you know why UTF-8 is the wonderfulest thing ever. See Markus Kuhn's excellent "Unicode and UTF-8 FAQ for Unix/Linux" at http://www.cl.cam.ac.uk/~mgk25/unicode.html for more. What does any of this have to do with XML? Nothing really, except that it's easy to get confused about what's legal in one versus what's legal in another. In UTF-8, *all* the ASCII control chars are legal. In XML, the only legal control characters are LF, CR, and TAB. If you want other stuff in XML, you have to have some higher-up encoding scheme, like Base64, going on. See http://greenbytes.de/tech/webdav/rfc3470.html#binary for details. XML 1.0 (and, I believe, 1.1) do use UTF-8 as their official default character encoding. But they're really supporting a subset of UTF-8: all non-control chars, plus LF, CR, and TAB. The other control chars cannot be represented directly in an XML document. -- kfogel, 2004-12-01 Further reading: * http://kunststube.net/encoding/ is a great article. I found it on 2015-06-27, and highly recommend it. It gives a more technically detailed discussion of character encodings, starting from the very basics and going into details of the various Unicode encodings, and even discusses some specific application behaviors and how to avoid common misunderstandings and problems when handling encodings. * "UTF-8 bit by bit" (https://wiki.tcl-lang.org/page/UTF%2D8+bit+by+bit) by Richard Suchenwirth is a wonderfully clear detailed explanation of the actual bit-level structure of UTF-8 strings. It refers out to "Unicode and UTF-8" (https://wiki.tcl-lang.org/page/Unicode+and+UTF%2D8) and the rather informative "The history of UTF-8 as told by Rob Pike" (http://doc.cat-v.org/bell_labs/utf-8_history). * https://sethmlarson.dev/blog/utf-8 is a very good blog post about how UTF-8 works, with illustrations.