"Why is this night different from all other nights?"
 
              "Why do good things happen to bad people?"
 
                           "What is UTF-8?"

For centuries, people have been asking fundamental questions about the
nature of the universe and the purpose of life.  Here we address the
third question, "What is UTF-8?"

In The Beginning, there was ASCII.

ASCII is both a character<->number mapping *and* an encoding system
for representing those numbers.  Note that these two things can be
decoupled, it just happens that ASCII does not decouple them.  (In a
moment, we'll meet a system that does decouple them.)  In ASCII, a
single space character has the number 32, *and* is represented by an
eight-bit byte 00100000.  The letter 'A' is 65 (01000001).  Etc, etc.

In classical ASCII, the eighth bit is never set.  ASCII values only go
up to 127 (01111111), which is the DEL character.  Note, while we're
here, that all the "control characters" in ASCII are <= 127.  They all
live in 7-bit ASCII.

Originally, the eighth bit was used by some communications protocols,
since it was unused as far as ASCII encoding was concerned.  Later
these protocols got sane and started using out-of-band data for flow
control, and the eighth bit became available for encoding purposes.
This was good timing, because Europe badly needed some more
characters, for vowels with umlauts and stuff.

So various European conventions arose in the upper ASCII regions.
These are the ISO-8859 or "Latin" family, the most common being
ISO-8859-1 (Latin-1).  Naturally, these could represent only 127 more
unique characters than lower ASCII.  Even this was not enough to cover
all European languages, let alone world languages, however.

Enter Unicode.

Unicode made an interesting decision.  They said "We want to be able
to encode an arbitrary number of characters."  So they decoupled
numeric value from representation.  In Unicode, each character has a
number (called a "Unicode code point").  How you represent that number
depends on which encoding of Unicode you use.  The number *could* be
57879302763479238192374 -- there's no reason Unicode couldn't have a
character with that value.  Of course, they haven't gotten that high
yet, but you never know.

Anyway, in Unicode, all the interesting action happens in the
representations, a.k.a. the encodings.

One simple encoding is UTF-32, which uses 4 bytes for every character.
Sometimes it is called "UCS-4" instead of "UTF-32".  "UCS" gives the
number of bytes, whereas "UTF" gives the number of bits.  I'll stick
with "UTF" in this document.

UTF-32 is a fixed-width encoding, because it has enough bits per char
to represent all Unicode code points (they're up to 1114111 as of this
writing).  "Fixed-width" means if you have a char array, you can jump
to the N'th char in constant time, because you know exactly how far it
is.  In UTF-32, you'd jump 4*N bytes into the array.

The two-byte encoding UTF-16 is in more common use than UTF-32 these
days, however.  UTF-16 is still fixed-width for most practical
purposes: since almost all characters in all modern languages are
below 65535, they can fit in just two bytes.

But both UTF-32 and UTF-16 have some problems:

   1) They waste a lot of space if you're just writing English text,
      because you use many more bits per char as you really need.

   2) They're not compatible with a lot of existing code, because
      they have all-zero-bit bytes, which look like end-of-string
      markers to C, for example.

Enter UTF-8, "The Beautiful Encoding".

UTF-8 works like this: Characters can occupy anywhere from 1 to 4
bytes.  That's right, it's not fixed width.  You can't instantly jump
to the Nth char in an array -- they just gave up on that.  You can
jump to the Nth *byte* in an array, but that byte might be in the
middle of a character, and you wouldn't even know how many complete
characters you'd skipped over to get there (though you would at least
have upper and lower bounds on that number).

In UTF-8, all classical (7-bit) ASCII chars are represented as
themselves.  If you see a byte in a UTF-8 string, and its 8th-bit is
zero, then you can just interpret it as that ASCII char.  You don't
need to know anything else about the context.  The bytes on either
side of it do not matter a whit.  Note that this works smoothly
because the Unicode code points for these characters match their ASCII
values.  That is, ASCII 0-127 match Unicode 0-127 in their *values*.
(It would be meaningless to make that claim about their
*representations*, of course, since Unicode itself does not specify
the representation.  But UTF-8 does specify the representation.)

For all Unicode code points from 128-2047, UTF-8 uses two bytes.  The
first byte has its upper two bits (8th and 7th) set to one, the 6th
bit set to zero, and then the remaining 5 bits hold part of the
character value.  The second byte has its 8th bit set but its 7th bit
clear, and the remaining bits hold the rest of the value.

I could tell you the rules for a three byte character, but you can
already see what's going on here.  In UTF-8, the rules for any given
byte are consistent:

   1) If the 8th bit is clear, it's a one-byte char, see ASCII.

   2) If 8th and 7th are set, it's the first byte of a multibyte
      character.

   3) If 8th is set but 7th is clear, it's a continuation (i.e.,
      non-leading) byte in a multibyte character.

I won't bother you with the complete rules for all multibyte chars.  I
think it's something like: if 8th, 7th, and 6th are set, but 5th is
clear, it's the first byte of a three-byte character, and so on.  In
other words, the number of bits set immediately after the 8th
indicates how many continuation bytes the character has.  Thus, UTF-8
could go up to 6 bytes or something (but in practice, I think the
upper end of that range is unused, and they've limited it to 4 bytes
for now).

The important thing is, you can pick up a UTF-8 stream in the middle
and get your bearings immediately.  If you pick up the stream in the
middle of a character, you won't know what that character was of
course, but you will be able to tell that you were in the middle of
*some* character, because you'll recognize case (2) above, and
therefore you can reset your stream at the start of the next
character, which you will be able to unambiguously recognize according
to the above rules.

Note also that UTF-8 doesn't have any all-zero bytes, except for ASCII
zero (Ctrl-A), which of course, could just as easily appear in an
ASCII string as a UTF-8 string.  So UTF-8 is as compatible with C
string functions as ASCII is.

Of course, UTF-8 doesn't encode, say, Chinese as space-efficiently as
UTF-16 or UTF-32 does.  It uses up extra bits on meta-data, so it
requires three bytes for every Chinese character instead of two.  I
say who cares.  It's still an amazing compatibility win.

So, now you know why UTF-8 is the wonderfulest thing ever.  See Markus
Kuhn's excellent "Unicode and UTF-8 FAQ for Unix/Linux" at
http://www.cl.cam.ac.uk/~mgk25/unicode.html for more.

What does any of this have to do with XML?

Nothing really, except that it's easy to get confused about what's
legal in one versus what's legal in another.

In UTF-8, *all* the ASCII control chars are legal.  In XML, the only
legal control characters are LF, CR, and TAB.  If you want other stuff
in XML, you have to have some higher-up encoding scheme, like Base64,
going on.  See http://greenbytes.de/tech/webdav/rfc3470.html#binary
for details.

XML 1.0 (and, I believe, 1.1) do use UTF-8 as their official default
character encoding.  But they're really supporting a subset of UTF-8:
all non-control chars, plus LF, CR, and TAB.  The other control chars
cannot be represented directly in an XML document.

                                                -- kfogel, 2004-12-01

Further reading:

* http://kunststube.net/encoding/ is a great article.  I found it on
  2015-06-27, and highly recommend it.  It gives a more technically
  detailed discussion of character encodings, starting from the very
  basics and going into details of the various Unicode encodings, and
  even discusses some specific application behaviors and how to avoid
  common misunderstandings and problems when handling encodings.

* "UTF-8 bit by bit" (https://wiki.tcl-lang.org/page/UTF%2D8+bit+by+bit)
  by Richard Suchenwirth is a wonderfully clear detailed explanation of
  the actual bit-level structure of UTF-8 strings.  It refers out to
  "Unicode and UTF-8" (https://wiki.tcl-lang.org/page/Unicode+and+UTF%2D8)
  and the rather informative "The history of UTF-8 as told by Rob Pike"
  (http://doc.cat-v.org/bell_labs/utf-8_history).

* https://sethmlarson.dev/blog/utf-8 is a very good blog post about
  how UTF-8 works, with illustrations.