Germ Lisp Update

Germ is fitter, healthier, more productive.

Previously…

In a previous post, I introduced Germ Lisp. Germ is a small (less than 2.5KiB) Lisp interpreter – written in x86 assembly – that can assemble itself. At the end of my previous post, I presented three ways I would like to see Germ improve. I’ve made significant progress on making Germ more capable which is what I’ll present in this post.

Two Germs

In order to keep the original Germ tiny, all of the improvements are behind a feature flag. This means there are two Germs: the original one, now called “Germ0” and the new one, which has stolen the name “Germ”. This allowed me to write whatever new features I wanted without worrying about size.

Kernel improvements

Speaking of new features, there are quite a few! There are also a few bug fixes, but I’m not going to cover them. Here are the splashy new features at the kernel level.

Memory management

Germ now supports allocations of contiguous bytes (“bytes objects”) and allocations of contiguous objects (“structs”). Bytes objects make strings and symbols more efficient, and enable proper bytevectors. Structs enable vectors and records.

Symbol hash tables

Symbols are now hashed when created, and can be used as keys in hash tables. This is used to speed up symbol interning, and also for top-level and module lookups.

Modules

The kernel now supports simple module search, and maintains a “current module”. Much like Guile, the current module starts off as a raw hash table, but it can be replaced with a module object that has its own hash table along with a list of imports.

There is also a new hash table of loaded modules, which can be accessed (and modified) with the primitive loaded-modules.

Syntax

I added support for two new keywords: set! and @@. There’s nothing special about set!: it works as you would expect. The @@ keyword allows module search. It takes a module name and variable name, and looks up the variable in the module (making use of the list of loaded modules).

Multiple values

The kernel provides call-with-values and values, both of which run with very little overhead. If you return from call-with-values with a single value, you have to allocate an argument list for the single value. If you use values, all you do is pop the consumer procedure off the stack and pass the arguments given to values directly to it.

Composable continuations

Rather than using the traditional call-with-current-continuation, I opted for “composable” (or “delimited”) continuations. I followed SRFI 226 (Control Features) for the interface, and wrote primitives for these mouthfuls: call-with-continuation-prompt, abort-current-continuation, and call-with-composable-continuation. These allow for all kinds of fun, but so far I’ve been using them for error handling.

Dynamic guards

I wrote a special primitive to support dynamic-wind. It’s called call-with-dynamic-guards, and it works just like dynamic-wind, but doesn’t call the pre-thunk or post-thunk in normal circumstances (only during stack winding and unwinding). It’s much easier to implement without the extra calls – remember this is all in assembly! Germ uses dynamic-wind for parameterize. I did flirt with continuation marks (see SRFI 226) for a while, but decided against it for now (they are good, just not required for this project).

Error checking

After a while, you get sick of seeing segfaults. It’s much nicer if (car 0) signals an error instead of crashing the whole program. I added quite a few little type checks and assertions. While it’s still possible to crash Germ from user code, it’s much harder to do it by accident. On top of that, I implemented stack trace printing so that you can quickly locate the source of an error. Here’s an example:

> (define (bar)
    (car 0)
    #t) ; avoid tail position
=> bar
> (define (foo)
    (bar)
    #t) ; ditto
=> foo
> (foo)
Unhandled exception in procedure bar
    created in module user
  called by foo
    created in module user
car: Not a pair (16 0)

This makes it easy to solve little errors like a missing procedure or using the wrong number of operands.

Modules

The original Germ0 initialization script only did two things: load a script and launch a REPL. On Germ, it now bootstraps a module system and then calls the boot thunk defined in the (germ boot) module.

There are actually two module systems: a “boot” module system and a regular one. The difference between the two is that the “boot” module system uses the simple reader and expander from the initialization script, while the regular module system uses fully featured versions built using the “boot” module system.

Both of these module systems are like Guile’s. That is, they use define-module for creating modules and use-modules for importing them (as opposed to define-library and import like in R⁷RS). Not everything is implemented yet, but you can use #:select and #:prefix while importing and #:export, #:re-export, and #:pure while defining. Both exporting and re-exporting support renaming.

Unlike Mes, which copies Guile 1.8 code for the module system, I’ve opted to write it all from scratch. Guile’s module system is very complex, and supports fancy things like autoloading and observers. I don’t think Germ will have to support any of that (except maybe autoloading, which we can probably fake). So far it’s been easier having simple code that only provides what’s actually needed instead of dealing with the much more complicated code from Guile.

Macros

Again, looking at Germ0, we only had Lisp-style macros (define-macro) – and we didn’t even have quasiquote! Now I’ve ported PopSyntax to Germ, which provides hygienic syntax-case macros. PopSyntax uses the “sets of scopes” approach, developed by Matthew Flatt. After fixing a few ellipsis bugs – turns out that feature is complicated – it’s working really well. It can expand all kinds of complicated macros, like Alex Shinn’s match and the quasisyntax and quasiquote implementations from (old) psyntax (Portable syntax-case). The main problem is that it’s very slow. I’ll discuss performance in general at the end of this post.

The `main` thunk

With all this in place, the boot thunk (which was developed using the simple reader and expander) can load the main thunk using a better reader and PopSyntax as an expander. The main thunk is normal (Guile-flavoured) Scheme code. It uses getopt-long to parse its command line arguments, and – don’t be too disappointed – can load scripts and launch REPLs!

At this point, though, the scripts are Guile-like scripts. And the REPLs are Guile-like REPLs. If I wanted to I could add support for “-L” for adding module directories or “-e” to run a particular procedure. It would all be straight forward. Yes, we have come full circle, but the scripts and REPLs can now be very complicated.

How complicated?

Well, at first I was excited and satisfied that Germ could run match. That’s a feature that’s as useful as it is taxing on the expander. But I kept going. It also runs getopt-long like I wrote above. It runs it completely unmodified from the current Guile sources. Did you know that getopt-long uses (ice-9 regex)? Guile uses the C library for regular expressions, which is not an option for Germ. However, I had already built a (ice-9 regex) wrapper around pregexp (Portable Regular Expressions for Scheme and Common Lisp) to add regular expressions to Mes. That also works unmodified on Germ.

That’s not the end though. I kept poking, and was able to get NYACC to parse C code. That’s right! Germ will read

int
main (void)
{
  printf ("hello\n");
  return 0;
}

and then spit out

(trans-unit
 (fctn-defn (decl-spec-list
             (type-spec
              (fixed-type "int")))
     (ftn-declr
      (ident "main")
      (param-list
       (param-decl
        (decl-spec-list
         (type-spec (void))))))
   (compd-stmt
    (block-item-list
     (expr-stmt (fctn-call
                 (p-expr (ident "printf"))
                 (expr-list
                  (p-expr (string "hello\n")))))
     (return (p-expr (fixed "0")))))))

Hooray! I bet MesCC wouldn’t be too much extra work, which is exciting. I’m more interested in getting some of the build machinery from GNU Guix running, but it intimidates me. It makes rather liberal use of Guile features. Ultimately I want the early bootstrap packages in Guix to be able to use this machinery along with G-expressions without relying on a pre-built Guile.

Performance

There’s good news and bad news when it comes to performance. I’ll start with the bad news. Booting Germ – getting to the point where the main thunk is loaded – takes 16 seconds on my computer. Six. Teen. Seconds. This is really rough. There seem to be two major reasons for this. First, because Germ is written in assembly, it makes it harder to implement features in the kernel. Most things are delegated to the Lisp layer. At the moment, everything that’s not required to be in the kernel, isn’t. This means that Germ has to read and expand a lot of code in order to know how to do basic things. Ports, read, write, hash table lookups, records, and much more are all outside the kernel. Second, PopSyntax is much more “correct” than it is “fast”. This is a problem with using sets of scopes, and PopSyntax is not even a fast implementation of sets of scopes. Basically, Germ has to expand a ton of code with a slow expander.

How about some good news? Sure, but first: a story. A fellow Mes hacker, Ekaitz Zárraga, wrote a blog post looking at the performance of Mes, Guile, and Chibi using the naïve Fibonacci function as a benchmark:

(define (fib n)
  (if (< n 2)
      n
      (+ (fib (- n 1))
         (fib (- n 2)))))

(write (fib 35))
(newline)

Computing (fib 35) results in calling fib 29 860 703 times! Unsurprisingly, Mes was slow, Guile was blazing fast, and Chibi was pretty darn good. Well, how did Germ do? Poorly! Germ took 4 times as long as Mes built with GCC and twice as long as Mes built with M2-Planet. This didn’t make any sense to me, since I know the internals of both Mes and Germ, and Germ should be faster….

Now the good news! The reason Germ is so slow is that it’s arithmetic procedures (+, -, and <) are not fully built in to the kernel. The kernel exposes those operations, but Germ adds variadic wrappers over the binary primitives. That means that Mes is calling C code directly for arithmetic, while Germ is calling assembly code via a few layers of Lisp. If you adjust the benchmark so that Germ uses its primitives directly, it only takes 40% of time that GCC Mes does! It’s still 8 times longer than Chibi, but I’m pleased that it’s at least theoretically faster than Mes.

“Unexec”

What do you do with a Lisp system that’s slow to boot? Be like (ancient) Emacs and “unexec” it. I’m kidding, but only sort of. I’ve been experimenting with primitives that essentially serialize and deserialize Germ’s heap and its current continuation. They are called dump-image and load-image. After booting and before calling main, the boot thunk instead calls dump-image to write an image of the bootstrapped Germ to disk. Then, another script can run directly on the kernel to bootstrap just enough to call load-image. The load-image call is basically like calling a continuation that has been serialized to disk. Germ is immediately back at the point where it was about to call main. This works beautifully, bringing up a REPL in the blink of an eye.

I’m conflicted about this approach, since it feels like I’m sweeping serious performance problems under the rug. On the other hand, having images for other scripts could be really handy. You could have an image capturing the point right before you call the main procedure from MesCC or Gash or whatever. That way, you pay the cost of expanding macros only once. You could even go so far as to pre-compute the top-level lookups, embedding the locations directly in the code (but that may be a topic for a future post).

Other performance improvements

One issue that I suspect is slowing Germ down a lot is allocating every closure. As an example,

(let ((foo 42)) (+ foo 15))

expands to

((lambda (foo) (+ foo 15)) 42)

and that lambda generates a full closure that gets run and then immediately thrown away. This creates a lot of unnecessary GC pressure and just wastes time in general. It wouldn’t be too hard to recognize that special case, and speed it up. Since it’s used pervasively, I suspect it would improve things a bunch.

As explained above, wrapping performance critical procedures like + makes everything crawl. There are two ways to improve this. First, I could add a bunch of more complicated primitives. This would certainly help, but it’s quite cumbersome. Programming in assembly is fine when you get a feel for it, but it sure keeps your ambition in check. Another option would be to implement case-lambda in the kernel. That way binary + would be the primitive, while all other kinds of + would dispatch to Lisp code. This would also help with map and its friends, whose fully variadic versions are much more complicated than their fixed arity counterparts. It would also help with simple optional operands (like the passing a port to write).

As a final performance thought, I’m tempted to allow Lisp code to install new primitives into the kernel. Maybe I could reserve some fixed amount of space for new primitives, and rig up a way to write, compile, and install new primitives directly from Lisp. Or maybe write the primitives in C and use MesCC to compile and link them into the kernel in a new build. Or maybe Pre-Scheme…. This all seems a little crazy, but clearly has excellent “fun hack” value.

Future

First, I need to clean up these new features. Most of them are fine, but I left a few things in WIP status to keep momentum up. The big issue is that I maintain two versions of Germ (and Germ0): one written in GNU assembly (for the GNU assembler) and one written in Scheme assembly (for Germ itself). This is mainly because the GNU assembler can make a version of Germ with debugging symbols. It also helps make sure there are no Scheme assembler bugs (the two outputs should be bit-for-bit identical). After a while, I got bored of porting from GNU assembler to Scheme, so I left some stuff unported. It’s a rather mechanical task that I just need to chew through.

Second, I took a few shortcuts to get NYACC running, and will have to clean those up, too. Because boot times are so bad, I iterated (in classic Lisp fashion) by keeping an image running and monkey patching new features into Germ as much as I could. This means I have a little script that reaches into the guts of various modules and fixes little issues. I need to move all those fixes into the module source files.

After those “eat your vegetables” tasks are done, I can continue having fun getting MesCC to work. Ultimately it would make sense to rewrite portions of MesCC to use an assembler and linker written in Lisp. I don’t plan to worry about that too soon, though.

Of course, there will have to be some performance experiments, and the other two tasks from the first blog post still remain: porting Germ to other processor architectures and porting Germ to hex0.

The main goal I’m looking at now is to get Germ running as the bootstrap seed for Guix. I’ve already experimented with Germ0 assembling Germ0, and that worked. Getting the rest of the Germ sources into Guix and building and running it is the next step. After that, I want to experiment with G-expressions and then the build machinery. If that all works, we’ll really be cooking.

Summary

Germ continues to show promise both as tool to clean up the early Guix bootstrap and as a viable Scheme interpreter for much of the rest of the bootstrap process (until Guile is available). We are starting to hit significant performance issues, which will have to be addressed. In theory, Germ should be fast enough, but a few little oversights are tanking its performance. It will be interesting to see if Germ can speed up C compilation, which has been a big issue for those working on bootstrapping. Overall I’ve been pleased with how stable Germ is (especially with the new error checking). It’s becoming a fun little system to hack on, which helps keep momentum (and motivation) up.

Germ Lisp Update

Previously…

Two Germs

Kernel improvements

Memory management

Symbol hash tables

Modules

Syntax

Multiple values

Composable continuations

Dynamic guards

Error checking

Modules

Macros

The main thunk

How complicated?

Performance

“Unexec”

Other performance improvements

Future

Summary

The `main` thunk