-*- mode: text -*- Minor Text, Locale, and Unicode This file explains how Minor represents characters and strings internally (it uses Unicode), how Minor relates its representation to that used by C programs (wide characters and multi-byte strings), and how Minor respects the user's locale. (The behaviors described here are not implemented yet. For the time being, Minor assumes that the multi-byte character encoding for C and the selected locale are UTF-8, and that wide character values are Unicode scalar values. This is a bug, and we would welcome help in implementing the behavior described here.) * Minor uses Unicode because we want support multilingual applications. I would like Minor to help users write multilingual applications: applications that handle text containing a mixture of languages. This entails a lot of different things, but the foundation is a representation for text that allows one to mix scripts. Unicode is the obvious choice: it certainly has the most support in surrounding systems (XML; Gnome; etc.), and I'm not aware of alternatives that are as thoroughly fleshed out. * Minor uses Unicode to represent characters and strings. In the Minor runtime itself, characters, like #\a or #\newline, are represented as Unicode code points. Strings, like "shoestring histrionics", are represented as arrays of Unicode code points. * Minor I/O ports carry byte streams; Minor I/O functions use UTF-8. To borrow text from the MzScheme documentation: by definition, ports in Minor produce and consume bytes. When a port is provided to a character-based operation, such as read, the port's bytes are read and interpreted as a UTF-8 encoding of characters. Byte-based operations, like read-bytes, work in the natural way. (This means that a byte-based input operation could partially consume a multi-byte UTF-8 character, and that byte-based output operations can produce byte sequences that are not invalid UTF-8 encodings.) Note that ASCII text (containing only characters with codes from zero to 127) is represented by exactly the same sequence of bytes in UTF-8. UTF-8 assigns non-ASCII meanings only to bytes 128 and above. * Minor translates between the locale's multi-byte encoding and UTF-8. Input ports created in the ordinary way translate text coming from outside sources from the current locale's encoding to UTF-8. Output ports do the reverse translation, from UTF-8 to the current locale's encoding. (Network ports are not translated by default.) Thus, programs written in Minor see streams of UTF-8 data appearing on input ports, and should write multi-byte characters to output ports using UTF-8, regardless of the locale. Minor also provides ways to produce untranslated ports, yielding input and output ports that carry bytes untranslated from and to the outside world. Minor's character-based functions don't change their behavior when applied to such ports --- they still use UTF-8 --- so they should only be used where it's known that the port actually carries UTF-8 text. There are a few other cases where Minor automatically translates between the current locale's encoding and UTF-8; Minor generally follows MzScheme's lead here. * This is a bit better than MzScheme's default behavior. MzScheme behaves as described here --- ports carry streams of bytes, character-based functions assume UTF-8 --- except that ports are not translated by default. The user must explicitly request translation between UTF-8 and the current locale's encoding. Applications should respect the current locale by default. While a system administrator can set a default locale for a system, the user can override that setting, so the current locale should be treated as a deliberate request from the user. * This is no more restrictive than ISO C's behavior. The ISO C standard I/O library divides its input and output functions into two categories: the "wide character I/O functions", which operate on characters, and the "byte I/O functions", which operate on bytes. ISO C does not allow programs to mix wide character and byte operations on the same port: the first operation performed on a port sets its "orientation", and all subsequent operations must be of the same orientation. Minor's interface has essentially the same restrictions. Translated ports are created by applying a translation to a port carrying bytes directly to or from the outside world. - On input, the translation process promises to consume no bytes from the underlying port until the first translated byte is requested of it. This means that, if the user is not interested in translation, and has not consumed any bytes from the translated port, the underlying port is still available, undisturbed. - On output, the translated port promises that, when flushed, a byte written to the underyling port will immediately follow the last translated byte. This is roughly equivalent to ISO C's behavior: using the translated port is equivalent to using wide character functions on a C stream; using the untranslated port is equivalent to using the byte functions. However, Minor Scheme is more flexible than ISO C, in that the underyling port is still available for use --- even if its state is only well-defined at certain points.