SOFORTH, a Forth system that will be ready Real Soon Now. The overall memory arrangement looks like this (NOT shown to scale, and addresses grow upward): ---------------- | Text | | Input | | Buffer | |________________| | | | _ Return | | | Data _ | | |_ Stack | | | Stack_| | | | | | | \/ \/ \/ \/ \/ | | | | | | /\ /\ /\ /\ /\ | | | | Dictionary | | | | | ---------------- The data and return stacks grow down toward the dictionary, which grows up. There is no particular place to store variables or constants; they live in the dictionary and share namespace with all other words. The data and return stacks are interleaved, so that they will both be sensitive only to the dictionary size. This implementation assumes 32-bit words. We'll worry about portability later; there should be only a few places where it will matter. The Dictionary. --------------- A Dictionary entry looks like this: ( |--------| = 1 byte ) ___________________________________ | | | end of parameters (0) | |________|________|________|________| | | | ... | |________|________|________|________| | | | parameter field | |________|________|________|________| | | | parameter field | |________|________|________|________| | | | parameter field | |________|________|________|________| | | | code pointer field | |________|________|________|________| | | | L I N K | |________|________|________|________| ... six words omitted, names are 31 bits long... ________ ________ ________ ________ | | | | | | name | name | name | name | |________|________|________|________| | pbit & | | | | | length | name | name | name | |________|________|________|________| (Addresses grow upward in the above diagram.) The address of the entry is considered to be the word whose first byte contains the precedence bit and name length (the first word of the entry). The first bit in that word is the precedence (IMMEDIATE/NON_IMMEDIATE) bit. The remaining 7 bits of its byte are the length of that name. Following this are the characters of the name itself (up to 31 bytes). The characters of the name are in reverse order, for ease of use by C's string functions. Later on, we might have variable length names, but Keep It Simple for now. Next comes the Link to the previous entry (the link is always on the next word-boundary past the name), then the code-pointer, then the parameter fields. The code pointer could point to one of several different kinds of functions: - address_interpreter(): run down the parameter fields doing address interpreter things (more on this later). - a primitive (in which case there are no PFAs). or maybe one of these: - variable_handler(): the first parameter field contains the value to be pushed on the stack, all the code needs to do is push it there and set some counters correctly (more later). - string_handler(): first parameter field contains the length of the string, the remaining ones contain the characters in the string. For now, the code will just print the string; later, we can have operations that actually use/modify the data. This code also has to set some counters correctly. The Address Interpreter. ------------------------ (PC = "program counter" PF = "parameter field") The way the address_interpreter() works is this: The function it's looking at could be a primitive ("+", for example, which adds the top two numbers on the stack), or it could be a word defined in terms of other words. In either case, it records the caller's next PF on the return stack (it obtains this by advancing the PC). It then *sets* the PC to point to the first PF of the word currently being executed, and starts a loop. The loop runs down the PFs, executing them, until it runs across one that's zero, at which point the PC's old value is restored from the return stack and address_interpreter() returns. Numbers are pushed on the stack by code which takes the next PF (obtained by incrementing PC again) and pushes its value, then increments PC again. Interpretation and colon compilation. ------------------------------------ Brodie says on page 284 of Starting Forth: "Here is what INTERPRET does, in plain English: Begin a loop. Within the loop, scan the next word form the input stream and try to look up. If it's defined, execute it, then make sure the stack has not underflowed. (If it has, exit the loop and display an error message.) If it's _not_ defined, try to convert it to a number, leaving it on the stack. Then repeat the loop until the input stream is exhausted. Now let's compare the interpreter's structure with that of the colon compiler: Begin a loop. Within the loop, scan the next word from the input stream and try to look up. If it _is_ defined, then treat it as a word. If the word is immediate, then execute it and make sure the stack has not underflowed. If it is _not_ immediate, emplace its compilation address. If the word is _not_ defined, try to convert it to a number, compiling it as a literal. Then repeat the loop until the input stream is exhausted, or the system state returns to interpreter mode." This all sounds just dandy to me, and so it's what I do. Conditionals. ------------- There are some branching primitives: BRANCH, BEQ, BNZ, and BZE. `IF' makes use of BRANCH and BZE (it could equally well have used BNZ, but the code is slightly smaller this way). An example definition: : bar if baz then ; The PFs for bar look like (starting from the left): _____________ | || | \/ BZE ADDR1 BAZ END_OF_PARAMS Actually, there's a little bit of fudgery that has to happen, because the loop in address_interpreter() will automatically increment the PC (and branching just works by setting the PC, after all). So, to compensate for what we know address_interpreter() is going to do to the PC, the address actually stored would be this: ____ | || | \/ BZE ADDR1 BAZ END_OF_PARAMS So. `IF' is an immediate word which first compiles in a branch instruction, then a blank entry (leaving the address of that entry on the return stack), and compiles in the words up to `THEN'. THEN, which really means `ENDIF' in Forth, is also an immediate word. It gets the blank PFA off the return stack (you remember, the one left by IF), and stores the last PFA, where BAZ is stored in the above example. It doesn't store the PFA containing END_OF_PARAMS because address_interpreter() will automatically increment the PC to that PFA in a moment, and we must compensate for that fact. `ELSE' is just a thief. It steals the IF's PFA off the return stack, stores its own address there, and leaves a different address for THEN. Neither IF nor THEN ever know the difference; in fact, there could be a bunch of ELSE-like entities between an IF and its THEN, usurping addresses heartlessly, and no one would ever be the wiser. : bar if baz else boo then ; The actual addresses would look like this (i.e.: I'm incorporating the aforementioned fudgery into this diagram): _________________ ____ | || | || | \/ | \/ BZE ADDR1 BAZ BRANCH ADDR2 BOO END_OF_PARAMS Compare the compiled code for an IF-THEN sequence with the code for an IF- THEN-ELSE.