(Work in Progress -- please feel free to finish this and mail it to me) ASCII takes up a little more space than it needs to because every character is the same "bit"-length, whether it's 1010101 or 0000001 Huffman encoding makes a prefix-unique character set in which not all characters are the same length, and makes sure that the most frequently used characters get the shortest lengths. It's not the absolute winningest method of compression, but it's sure one of the cutest, and is often used in combination with other methods in compression programs. Note that one couldn't really use Huffman-encoded character sets as active data -- you'd have no way to jump to character N, because first you'd need to know the lengths of all the intermediate characters! It's useful for saving space in inert files, though. What does "prefix-unique" mean? Let's assume a simplified alphabet of 8 characters (so that it could be represented in 3 bits in normal ASCII). In order of frequency, they are A, B, C, D, E, F, G, H. (A is the most common, H the least, that is). Now look at the following tree: _______________________ | | 0 1 | | ------------ ------------- 0:A 1:B 0 1 / | | \ / | | \ / | | \ 0:C 1:D 0 1 / | | \ / | | \ / | | \ 0:E 1:F 0:G 1:H The leaves of the tree are letters in the alphabet, but the tree is not balanced. Some characters require only two bits (A and B, for example), others make do with 3 (C and D), and some actually require 4 bits (E, F, G, and H). But the system is unambiguous: the string 010100111101101 translates to BBAHBD ...one simply starts at the beginning of the string and the root of the tree, follows nodes of the tree according to the bits in the string until reaching a leaf, writes down the letter, and starts back up at the beginning of the tree while preserving one's place in the bit string, until the end of the string is reached. The question is, are we really winning? Well, the string BBAHBD in a standard 3-bit code would have required 6 * 3 = 18 bits to represent; we managed to do it in 14 with Huffman encoding. However, the space savings in Huffman-encoding is entirely dependant on statistics -- on the usage-frequency count of the alphabet being encoded. If we had to encode this string: DDDDDD we'd need 24 bits Huffman-encoded, but only 18 in standard fixed-length code! We're counting on the fact that strings like DDDDDD don't occur as often as ones like BBAHBD, which is more in accord with our frequency estimate. There is an algorithm for forming the best Huffman "tree" given an alphabet and the usage frequencies of its symbols, and the Huffman-encoded alphabet will always be more efficient, in the long run, than the fixed-length alphabet. Here is the algorithm: [insert 8 oz of pure, godlike, Huffman genius here]