The Design of Minor Scheme -*- mode: text -*- ========================== Minor Scheme is an implementation of the Scheme programming language. Its features include: - good performance, via: + just-in-time and ahead-of-time compilers generating native machine code, and + precise, generational garbage collection, - full support for native system threads, - full source-level debugging, - a C API and a foreign function interface, for interfacing with C code, - a module system which supports separate compilation and cross-compilation, and - a hygenic macro system. Minor currently requires: - Either: + An IA-32 family processor, 80486 or later (the 80386 does not guarantee atomicity of word stores in an SMP configuration), or + An AMD64 family processor. - POSIX (IEEE Std 1003.1, 2003 Edition), with the following optional functionality (marked as indicated in the margins of the spec): + X/Open System Interface Extensions ("XSI") (the sigaction interface, in particular) + Threads ("THR") + Semaphores ("SEM") - A version of GNU CC implementing ISO C (ISO/IEC 9899:1999 (E), commonly called "C99") - Versions of GNU CC, the GNU Binutils, and the GNU C library that support the '__thread' thread-local storage class. I'm certainly interested in patches needed to make it work with other versions of that software. As far as portability is concerned, I'm interested in changing Minor's interfaces to system-specific facilities as needed to make Minor easier to port, but beyond that, I want to invest my own time in supporting the machines I actually develop on. I'll leave writing code for other systems to interested parties. It is being developed on the following two systems: - A Pentium II running Fedora Core 4 - An AMD64 running Fedora Core 4 Fedora Core 4 uses: - Linux 2.6.14 - GNU libc 2.3.3 - GCC 4.0.1 - Binutils 2.15.94.0.2.2 The packages in Fedora Core generally include patches against their upstream sources. * Minor's Goals ** Basic Requirements Here are the basic requirements and goals for Minor Scheme. These are all supposed to be characteristics apparent to the user, not strategies for providing such characteristics (e.g., instead of "generates native code", we would say, "provides good performance"). - Minor Scheme should provide hygenic macros that can do arbitrary computation, and a Flatt-style module system that supports predictable separate compilation in the presence of macros. - Minor Scheme should support the full Scheme numeric tower, including arbitrary-precision integers and rational numbers. - Minor Scheme should support a full set of continuation operators, including Queinnec's control operators and one-shot continuations. - Minor Scheme should use modern Unicode (i.e., including those characters with codes above 64k) to represent strings. It should provide a separate 'byte array' type for working with things other than text. - Minor Scheme should provide full source-position tracking, both for compile-time and run-time errors. This should include tracking code introduced by macro expansions. - Minor Scheme should support source-level debugging. - Minor Scheme should have a good interactive user interface, with legible, helpful error messages. - Minor Scheme should be easy to use as a script interpreter. - Minor Scheme should allow the use of native system threads, on both uniprocessor and multiprocessor systems. - Minor Scheme should provide a thread-aware C API, to allow C code to call Scheme code. The API's design should not preclude the use of copying, generational, incremental, or concurrent garbage collection. - Minor Scheme should provide a full foreign function interface, to allow Scheme code to call C code. - Minor Scheme should have a full interface to POSIX system calls, Linux system calls, and other essential system libraries. - Minor Scheme should fit well into other build systems: the ahead-of-time compiler should produce .o files that the user can link into their programs or shared libraries in the normal way, perhaps with reference to some run-time Minor libraries as well. - Minor Scheme should facilitate the distribution of Scheme libraries separately from the Minor Scheme distribution itself. - It should be possible to build Minor Scheme from source with the standard 'configure; make' sequence. The distribution should include no binaries or other object files; every file should be a source file. - Minor Scheme should have good performance. It should compare well with Bigloo, Chez, Chicken, MIT Scheme, MzScheme, RScheme, and SCM --- both in interactive performance, and in the speed of pre-compiled code. ** Secondary Goals Once the basic requirements have been met, here are others that are necessary for a 1.0 release: - Minor Scheme should have an interface to the Gtk 2.0 graphical user interface toolkit. - Minor Scheme should have equivalents for 'lex' and 'yacc', for constructing custom parsers and lexers. - Minor Scheme should have basic support for parsing and generating XML. - Minor Scheme should have a full, efficient regular expression library, supporting both greedy and non-greedy operators. Repetition operators should be capable of returning lists of match values, not just a fixed set. ** Basic Non-Requirements Here are considerations that are not important for Minor Scheme: - Minor Scheme need not avoid taking advantage of system-specific facilities. It is not meant to run out-of-the-box on any ISO C, POSIX, etc. system. - Minor Scheme need not support LinuxThreads. It may require the new "Native POSIX Threads Library" thread system, which properly implements the POSIX 'pthread' interface. * Tactics Here are the specific approaches we'll use to achieve the goals above. ** GC, then interpreter, then compiler. We'll produce Minor in the following steps: - Write the GC first. SMP-safe and everything. - Many of the functions in are just constructors and accessors, and are trivial given a GC. Those are next. Obviously, 'mn_eval' is not in this category. - Implement an interpreter in C for 'core Scheme', the language produced by macro expansion. Write this on top of the C API; since the API allows you to define Scheme procedures in C, this will be indistinguishable at the user level from the native code implementation. (I have this feeling modules are involved here. What do the free variables in a macro-expanded program refer to?) - Implement, in core Scheme, the macro expander and module system. Effectively, this implements full Scheme by translation into core Scheme. Extend the C runtime as necessary, but try to write as much as possible in Scheme, since the performance of the interpreter is not important, and the Scheme code can be shared with later stages. - Implement an assembler. This has two parts: - an arch-independent part: just a library for putting together byte strings that contain references to labels: blocks, labels, relocs, annotations, etc. - an arch-dependent part: this emits machine-code representations of some architecture's instructions, using the arch-independent facilities The result of the assembly process is blocks of bytes, annotated with labels it defines, labels it refers to, and relocs to say how to patch the latter's values into the blocks. This is what the machine code -> procedure API expects; it's also what the ELF reader and writer (see below) operate on. - Implement machine-code procedures. This entails designing: - representations for machine-code procedures - calling/returning conventions for Scheme procedures - representations for Scheme continuations - representations for C continuations in Scheme - code->register use mappings - code->heap use mappings - code->unwinding info mappings (for backtraces and exceptions) - code<->source location mappings (for debugging and error messages) - code->environment mappings (for debugging) - code->catch block mappings - an API to let Scheme code wrap up hunks of machine code as actual procedure objects, annotated as above Designing the code annotations will be a challenge: if represented directly, they will be enormous. Diwan, Moss, and Hudson describe their experiences: they had to go to some length to get the annotations down to 16% of the code size, using a safe-gc-points approach. To allow GC to happen at arbitrary instruction boundaries, well, I don't know. - Implement a JIT compiler for core Scheme in full Scheme. - Implement separate compilation, so that the standard Unix linker can link separately compiled modules into a running program. Use ordinary ELF relocatable object files to represent separately compiled modules. - Build an interpreter-free Scheme executable, by compiling the JIT compiler to ELF files and then linking them in. Now you have a interactive, native-code, JIT compiler! - Once the compiler can generate functions directly callable from C, then we can re-implement the Minor C API functions in Scheme. Since code generated by the Minor compiler is annotated code, it won't require synchronization as control enters and leaves incoherent sections, so the small functions (like mn_car and mn_cdr) will probably be faster than the C versions. - Build all sorts of packages: RPMs, Debian .debs, Windows zip files, and so on to get a sense of how we can help packaging go smoothly. However, there should also be a turnkey way to build source Scheme packages --- ./configure && make, perhaps, with additional conventions for where libraries get installed. ** Multi-threaded, generational GC Minor should use generational, non-incremental garbage collection. Generational collection reduces both total collection time and typical pause time. Incremental collection makes the maximum pause time quite short, but typically at the cost of increased overall run time; I don't think it's the right trade for most applications, especially since some non-incremental (but generational) collectors already seem to have almost no noticeable collection pauses. The collector must support multi-threaded programs. However, this is _not_ the same as supporting concurrent collection, where one thread can collect while other threads run mutator code. As far as I can tell, concurrent collection entails the same sorts of performance costs as incremental collection. Instead, the Minor collector will stop all mutator threads, collect, and then continue. The C API in include/minor/minor.h effectively requires each thread to register itself with the GC --- every function expects an mn_call argument, and call arguments belong only to a specific thread. This means the collector has the information it needs to find all the threads that might be touching the heap, and stop them if they are. The first collector will be a copying collector, but on today's machines memory is so slow that non-copying collectors may have the advantage. The only catch is, how do you represent generations without copying? If you set aside two bits per object (four generations), that's more than 3% overhead, just for generational information. Thus the traditional approach of using the area an object lives in to indicate its generation --- which requires copying. One half-and-half idea would be to have a single bit per object (1.6% overhead) lying around in a table off to the side that could act as the lowest bit of the generation number, so that every even->odd promotion could be done without copying, just by setting the bit. Every odd->even promotion would actually copy, and clear the bit. That'd give objects more time to die before we take the time to copy them. Anyway, once we've got something running, there will be lots of room for improvement. ** Just-in-time compiler To get competitive performance, Minor Scheme will do just-in-time compilation to native code. For ideas, we can look at Mono's code, read the papers they refer to, and look at Dybvig's papers about Chez Scheme. I've heard that Dybvig has the following rule for Chez Scheme: he only includes optimizations that, when performed on the Chez compiler itself, speed it up enough to overcome the cost of performing the optimization, when the compiler compiles itself. In other words, the compiler generates better code, but never gets slower --- each optimization "pays for itself". This is a great JIT rule. However, it does preclude optimizations that would only benefit non-compiler-like code: floating-point unboxing, SIMD detection, and so on. We'd have to find a different rule to decide whether to include these sorts of things. The rule only applies to optimizations: I hear that hygenic macros slowed down the compiler a lot --- but they certainly were included. ** Machine code verifier I don't know whether this is really a good idea. But, as I've been playing with prologue analyzers in GDB, it seems to me that most compiler-generated machine code is pretty easy to analyze and prove type-correct. This is especially true for dumb compilers, but it looks to me like it's also true for the output of optimizing compilers. So it should be possible to have a pass that runs after the compiler and assembler that analyzes the machine code and verifies that the compiler isn't using uninitialized registers, is untagging and GC protecting properly, and so on. George Necula did something like this for his Proof-Carrying Code system, and he said it was surprisingly helpful at catching compiler bugs: when the compiler generates bad code, it won't type-check. With my assembler/disassembler stuff (in particular, the insn-case form), I think this might be much easier than one would expect, and very helpful. Another thought here: rather than propagating detailed GC-tracking information on every node through the compiler, perhaps we could only propagate it on selected nodes (those that become labeled instructions, say), and then re-infer the GC annotations after machine code has been generated. That inference pass would be the better part of a machine code type-checker. ** C interface To make it easier to use Minor Scheme within existing build systems, and with existing tools, the Minor ahead-of-time compiler should produce ordinary ELF relocatable object files (".o files"), that the user can link with other .o files generated by other compilers to produce working mixed-language executables. The user will probably need to link against a run-time library, too ("-lminor"). The Minor AOT compiler should also be able to produce header files containing declarations for the functions, variables, and other constructs the Scheme module exports. For example, the following makefile should work: %.o: %.scm minor-compile -c $< %.h: %.scm minor-compile --header $< prog: main.o gcd.o cc main.o gcd.o -lminor -o prog main.o: main.c gcd.h gcd.o gcd.h: gcd.scm Here is main.c: #include #include "gcd.h" int main (int argc, char **argv) { printf ("%d\n", gcd (atoi (argv[1]), atoi (argv[2]))); return 0; } Here is gcd.scm: (module gcd Minor (import c-ffi) (define (gcd a b) (if (= b 0) a (gcd b (remainder a b)))) ;; Export the Scheme function gcd as a C function named "gcd", ;; which expects two int arguments, and returns an int. The ;; C function's arguments are passed to the Scheme function in ;; the obvious way. (export-c-function int "gcd" (int int) (gcd ...))) The command "minor-compile --header gcd.scm" should produce the file "gcd.h", containing something like the following: /* gcd.h --- C functions exported by gcd.scm Generated by minor-compile 0.0 as follows: minor-compile --header gcd.scm */ int gcd (int, int); The build process should behave like this: $ make minor-compile --header gcd.scm cc -c main.c minor-compile -c gcd.scm cc main.o gcd.o -lminor -o prog $ ./prog 12 16 4 $ The 'export-c-function' form takes care of generating code for the C-visible function that converts the incoming arguments to Scheme objects, and the outgoing return value to a C value. As used here, the wrapper will also make sure Minor has been initialized (i.e., call mn_init), and find an appropriate mn_call object to use for the conversions and the call. Options to 'export-c-function' not shown here will allow C code to pass an mn_call object explicitly; in this case, the initialization check will be unnecessary. So, to create a stand-alone executable in Scheme, one could write: (module gcd2 minor (import c-ffi) (define (gcd a b) (if (= b 0) a (gcd b (remainder a b)))) (define (main argv) (display (gcd (string->number (list-ref argv 1) (list-ref argv 2)))) (newline)) ;; Export the Scheme function main as a C function named "main". ;; The C function takes two arguments: an int named "argc" and a ;; char ** named "argv". argv is an array whose length is argc, ;; where each element is a null-terminated string of chars. (export-c-function int "main" ((int argc) ((list c-string argc) argv)) ;; The C "main" function converts its argv to the natural ;; corresponding Scheme type (a list of strings), and passes ;; only that list to the Scheme main; argc is unnecessary in ;; Scheme. (main argv))) One could compile and run this file as follows: $ minor-compile -c gcd2.scm $ cc gcd2.o -lminor -o gcd2 $ ./gcd2 30 40 10 $ *** Static Checking / Rechecking The C interface should be designed to allow drift between underlying interfaces and generated interfaces to be detected at compile-time. The source code should include a complete description of the generated interface, how its values are produced from the underlying interface's values, etc. The generation pass should always check this against the underlying code. On the Scheme side, of course, it can be impossible to tell statically what types are going to be produced, but we should do the best we can with the information we have. * The Minor ABI The interface for loading new machine code into a running Minor program, and the conventions for producing ELF object files to be linked against the Minor run-time, are public and documented interfaces. Minor's Scheme compiler is not privileged code; it is simply another client of those interfaces. (I'm not sure there will ever be any other clients, but this is still an important interface, and one worth keeping simple and documenting.) As described above in "C Interface", the Minor ahead-of-time compiler translates Scheme modules into ordinary ELF relocatable object files (.o files), which can be linked using the standard system linker to produce a working executable. The Minor ABI describes the form relocatable object files should have in order to be linked against the Minor runtime, or loaded into an existing Minor session. * Implementation ** Macros ** Modules ** Compilation *** Optimizations *** Code Generation ** Allocation and Garbage Collection *** Multi-threaded collection Right now, the Minor garbage collector stops all threads when any thread needs to collect. However, if each thread allocates in its own nursery, then inter-nursery pointers can only be created by mutations, which we could track. Then each thread could collect its own nursery without stopping any other threads. The challenge is allowing other threads to access our nursery while we collect it: normal forwarding pointers, stored in the objects themselves, would look like corruption to other mutator code. But the nursery can put forwarding pointers outside objects (say, in the word before the object), so the objects remain usable as we scavenge them. This non-standard representation is easy to deal with: - Mutators always allocate in the nursery, so mutator code will always construct non-standard objects. - The collector switches from scanning one generation to another at well-defined points, so we can easily have a custom scanning function for the nursery. We can initialize the nursery forwarding pointers when the block is allocated, so the mutator can just skip over those words as it allocates: the number of stores needed for an allocation won't increase. Larger objects will increase the nursery cache footprint, though. We'll need to put off updating references from other nurseries until after scavenging is complete: until then, the to-space objects aren't ready for use. And we'll need a write barrier between the completion of scavenging and the first inter-nursery update, so the other mutator threads will see the effects of our scavenging. This idea seems pretty straightforward; if you know of papers describing it, please let me know. Actually, this is completely wrong-headed. But maybe it can be saved, so I'm not going to delete it. - We can't let other threads use the old copy of our nursery while we collect; what if they mutate the old copy? - It's not true that inter-nursery references are only created by side effects. The *first* inter-nursery reference must be created by a side-effect, but once it's there the other thread can copy the reference all over the place. Including to other threads. Now, the only way a reference to one of thread X's nursery objects can be created is if thread X creates one. What if we pull other threads into the collection only if they could possibly have seen a reference to one of our nursery objects? *** Write Barriers *** Code Annotations ** Object File Handling *** generic ELF assembler *** IA-32 assembler *** Dwarf assembler ** Bootstrap interpreter ** Unicode Using Unicode in strings will require some extra processing time for converting strings to and from Unicode. However, this seems like a worthwhile price to pay for having a single, powerful string representation within Minor itself. A more serious drawback is that the conversion into Unicode, and then back to the outside world's representation, may not preserve the string byte-for-byte when the external encoding uses meta-information that isn't retained in Unicode. Unicode promises a clean round trip for individual characters, not for meta-information used to represent those characters. For example, a string using a stateful encoding may include a sequence to switch into one state, but then immediately switch into another state, before producing any actual characters in the first state. Such a transition has no representation in Unicode; Unicode only represents characters, not state transitions. So a round trip from a stateful encoding to Unicode and back would drop the first (useless) state transition. The sequence of actual characters is preserved, but not the exact sequence of bytes used to encode it. Of course, if the end consumer is only looking at characters, then this loss of information is no problem: Unicode goes to great lengths to preserve them. But many operating system kernels do not put filenames in any canonical form before looking them up; they simply treat them as opaque byte strings. Or if the string is to be used as a key in a database, the database might also treat it as an opaque string of bytes. To preserve strings byte-for-byte in cases like these, Minor developers would need to hold the information in byte vectors, not strings. But since a developer general has no idea a priori what character sets his code will be used in, this creates an incentive for developers to simply use byte vectors everywhere, to avoid mangling users' strings. There are several things to point out about this concern: - It's a bad idea to use stateful encodings for filenames in the first place: you could end up with any number of filenames in a directory that look identical, differing only in the arrangement of the meta-information in the names. I suspect that most Unix systems avoid them, for this reason; it would be good to ask Handa-san about this. (For what it's worth, POSIX says that the behavior of the standard utilies (shell commands) and System Interface functions on strings using stateful encodings is implementation-dependent. The POSIX Character Set Description File format cannot describe such encodings.) - Any C program which converts 'char' strings to 'wchar_t' strings using the C library's multi-byte string handling functions will have the same behavior: they will preserve the sequence of characters, not the sequence of bytes. - Java programs will have this problem, too. People don't mind Java, right? (It would be good to talk to Tom Tromey about this.) It's possible to side-step the problem by having two distinct string encodings within Minor: one for strings in the external character set, which are not translated, and one for strings in Unicode. But what happens when you append one type of string to the other? Obviously some conversion happens, but in which direction? Converting both to Unicode would mean that innocent-looking operations (like appending a "/", say) would result in conversions. Converting both to the external encoding would require knowing what shift state the external-encoding string ended in, which is a mess. And library functions like regular expression matchers would need to be duplicated, or make some sort of dynamic decision about how to access string contents. In this light, I think the benefits of having a single, uniform, powerful internal representation outweigh the drawbacks. * Infrastructure ** Bug database ** Per-commit regression testing ** Nightly benchmarking * Readings Olin Shivers. Control-Flow Analysis of Higher-Order Languages. Ph.D. dissertation, Carnegie Mellon University, May 1991. Technical Report CMU-CS-91-145, School of Computer Science. ftp://cs.cmu.edu/afs%2Fcs.cmu.edu%2Fuser%2Fshivers%2Flib%2Fpapers/diss.ps.Z This one is really essential. Lambda is so powerful and hairy you can go crazy trying to think of accurate ways to analyze code built from it, but this thesis is a completely legible explanation of an open-ended analysis technique that is clearly correct, because it is simply a conservative approximation of actually interpreting the code. If you can write an interpreter and be confident it's correct, then you can write a flow analyzer. Handy caveats included. It was the inspiration for my new prologue analysis framework in GDB. I think a lot of different analyses can be subsumed by this one. I don't see why we really need to use CPS. Flatt, M. Composable and compilable macros: You want it when? In Proceedings of ACM SIGPLAN International Conference on Functional Programming, 2002. http://citeseer.nj.nec.com/flatt02composable.html This paper is the first one I've seen that really nails why separate compilation can be kind of hairy for languages based on powerful macro systems. I'm not sure we want to use this macro / module system, but we've got to address the points it raises. Andre van Tonder. SRFI-72: Hygienic macros. http://srfi.schemers.org/srfi-72/srfi-72.html Hygienic macros are very hard to explain. call/cc is too, but after working with it for a while, I feel like I understand it well, and I'm convinced it's the Right Thing. I've not yet reached the same degree of comfort with hygienic macros. I see a lot of confusion about them on the mailing lists, so I think I'm not alone. It all makes me question whether they really are The Right Thing after all --- surely the Right Thing wouldn't be so difficult to think about. The macro system described in this SRFI is a simplification of the syntax-case-inspired hygienic macro systems popular nowadays, and I think it's worth looking into. Oscar Waddell. Extending the Scope of Syntactic Abstraction PhD thesis, August 1999. http://www.cs.indiana.edu/~owaddell/papers/index.html This thesis covers a very simple and powerful module system (a variant of which is in Chez Scheme), and an inlining algorithm. What do these have to do with each other? The argument is that they complement each other --- you feel more comfortable using powerful macros because you can be confident that you'll still get good code. It still seems like an odd mix to me. Be that as it may, the inlining algorithm looks awesome; I definitely want to do this. It's very good at deciding what to inline, so that it doesn't bloat up the code much --- actually, it sometime makes code smaller. You might think you'd want to do flow analysis before inlining (to see which procedures are being used where), but actually, I think doing an inlining pass *before* flow analysis would be better. Olin Shivers' thesis actually describes a family of flow analyses that trade accuracy for time / space complexity; family members may differ in the extent to which they distinguish separate calls to the same function. Now, if you inline a particular call and then do flow analysis, the analysis will obviously treat the inlined call as distinct from other calls to that function. If you constrain the inliner to expand the code only by a constant factor, then you've told the flow analysis to distinguish certain calls while still running in linear time. I think. Olin Shivers' history of T has pointers to a bunch of seminal Scheme papers and dissertations: http://www.paulgraham.com/thist.html David F. Bacon, Perry Cheng, and V.T. Rajan. A Unified Theory of Garbage Collection. http://www.research.ibm.com/people/d/dfb/publications.html Hans Boehm recommended this paper on gclist, expressing some strong opinions about finalization and destruction. I could really use a unified theory of garbage collection. Ulrich Drepper. How to Write Shared Libraries. 2002-11-3, based on a tutorial given at UKUUG 2002 Conference in Bristol. http://people.redhat.com/drepper/dsohowto.pdf Available from http://people.redhat.com/drepper, which also has slides. Olin Shivers, James W. Clark and Roland McGrath. Atomic heap transactions and fine-grain interrupts. In \emph{Proceedings of the 1999 ACM International Conference on Functional Programming (ICFP)}, September, 1999, Paris, France. http://www.ai.mit.edu/~shivers/citations.html#heap Amer Diwan, Eliot Moss, and Richard Hudson. Compiler support for garbage collection in a statically typed language. Manual Serrano and Marc Feeley. Storage Use Analysis and its Applications http://www-sop.inria.fr/mimosa/Manuel.Serrano/publi/sf-icfp96.ps.gz William D. Clinger. Proper Tail Recursion and Space Efficiency ftp://ftp.ccs.neu.edu/pub/people/will/tail.ps.gz Germain, Feeley, Monnier Termite: a Lisp for Distributed Computing http://lisp-ecoop05.bknr.net/pdf/19654.pdf Not sure this is relevant at all, but since we've opened Pandora's little concurrency box already, we might as well let this one out too... - Mike Ashley's flow analysis paper - Soft typing - PLT's successor to soft typing - SysV ABI (for ELF format) - Intel manuals - JNI api - _Engineering a Simple, Efficient Code Generator Generator_, Fraser, Hanson and Proebsting - Mono JIT * Resources ** benchmarks http://swissnet.ai.mit.edu/~jaffer/CNS/benchmarks has benchmarks computing the digits of pi and the mean of many pseudo-random numbers. see http://mangler.sourceforge.net/benchmarks.html The DHist benchmark should be fairly portable, but use the dhist.srfi-0 file from the tarball unless you want to get involved with an S2 installation. dhist.srfi-0 is a single source that is nearly all R4RS, with just a couple of COND-EXPANDS used to accomodate each individual Scheme system's build/optimization quirks. ** test suites http://swissnet.ai.mit.edu/ftpdir/scm/r4rstest.scm is a R4RS conformance test for Scheme implementations. ** call/cc Nils M Holm collected call/cc tests from c.l.s. ** Unicode *** SRFI-13 support I don't think any Schemes other than Gauche have full support for SRFI-13's ucs-range->char-set, but for those working on Unicode for new or existing Schemes you may want to grab: http://synthcode.com/scheme/char-set.scm which is portable (SRFI-14 only) code defining char-sets for all Unicode scripts and all core Unicode properties.