-*- mode: text -*-

		   Minor Text, Locale, and Unicode


This file explains how Minor represents characters and strings
internally (it uses Unicode), how Minor relates its representation to
that used by C programs (wide characters and multi-byte strings), and
how Minor respects the user's locale.

(The behaviors described here are not implemented yet.  For the time
being, Minor assumes that the multi-byte character encoding for C and
the selected locale are UTF-8, and that wide character values are
Unicode scalar values.  This is a bug, and we would welcome help in
implementing the behavior described here.)


* Minor uses Unicode because we want support multilingual applications.

I would like Minor to help users write multilingual applications:
applications that handle text containing a mixture of languages.  This
entails a lot of different things, but the foundation is a
representation for text that allows one to mix scripts.  Unicode is
the obvious choice: it certainly has the most support in surrounding
systems (XML; Gnome; etc.), and I'm not aware of alternatives that are
as thoroughly fleshed out.


* Minor uses Unicode to represent characters and strings.

In the Minor runtime itself, characters, like #\a or #\newline, are
represented as Unicode code points.  Strings, like "shoestring
histrionics", are represented as arrays of Unicode code points.


* Minor I/O ports carry byte streams; Minor I/O functions use UTF-8.

To borrow text from the MzScheme documentation: by definition, ports
in Minor produce and consume bytes.  When a port is provided to a
character-based operation, such as read, the port's bytes are read and
interpreted as a UTF-8 encoding of characters.  Byte-based operations,
like read-bytes, work in the natural way.  (This means that a
byte-based input operation could partially consume a multi-byte UTF-8
character, and that byte-based output operations can produce byte
sequences that are not invalid UTF-8 encodings.)

Note that ASCII text (containing only characters with codes from zero
to 127) is represented by exactly the same sequence of bytes in UTF-8.
UTF-8 assigns non-ASCII meanings only to bytes 128 and above.


* Minor translates between the locale's multi-byte encoding and UTF-8.

Input ports created in the ordinary way translate text coming from
outside sources from the current locale's encoding to UTF-8.  Output
ports do the reverse translation, from UTF-8 to the current locale's
encoding.  (Network ports are not translated by default.)  Thus,
programs written in Minor see streams of UTF-8 data appearing on input
ports, and should write multi-byte characters to output ports using
UTF-8, regardless of the locale.

Minor also provides ways to produce untranslated ports, yielding input
and output ports that carry bytes untranslated from and to the outside
world.  Minor's character-based functions don't change their behavior
when applied to such ports --- they still use UTF-8 --- so they should
only be used where it's known that the port actually carries UTF-8
text.

There are a few other cases where Minor automatically translates
between the current locale's encoding and UTF-8; Minor generally
follows MzScheme's lead here.


* This is a bit better than MzScheme's default behavior.

MzScheme behaves as described here --- ports carry streams of bytes,
character-based functions assume UTF-8 --- except that ports are not
translated by default.  The user must explicitly request translation
between UTF-8 and the current locale's encoding.

Applications should respect the current locale by default.  While a
system administrator can set a default locale for a system, the user
can override that setting, so the current locale should be treated as
a deliberate request from the user.


* This is no more restrictive than ISO C's behavior.

The ISO C standard I/O library divides its input and output functions
into two categories: the "wide character I/O functions", which operate
on characters, and the "byte I/O functions", which operate on bytes.
ISO C does not allow programs to mix wide character and byte
operations on the same port: the first operation performed on a port
sets its "orientation", and all subsequent operations must be of the
same orientation.

Minor's interface has essentially the same restrictions.  Translated
ports are created by applying a translation to a port carrying bytes
directly to or from the outside world.

- On input, the translation process promises to consume no bytes from
  the underlying port until the first translated byte is requested of
  it.  This means that, if the user is not interested in translation,
  and has not consumed any bytes from the translated port, the
  underlying port is still available, undisturbed.

- On output, the translated port promises that, when flushed, a byte
  written to the underyling port will immediately follow the last
  translated byte.

This is roughly equivalent to ISO C's behavior: using the translated
port is equivalent to using wide character functions on a C stream;
using the untranslated port is equivalent to using the byte functions.
However, Minor Scheme is more flexible than ISO C, in that the
underyling port is still available for use --- even if its state is
only well-defined at certain points.