Germ as Bootstrap Guile

Germ can run the GNU build system from Guix

Introduction

Very recently, Germ learned another new trick. It can now replace Guile as the build-side Scheme interpreter for Guix when building the GNU Hello package. This is an exciting milestone, as the fear in my heart of trying to do this with Mes was what motivated Germ in the first place. Of course, Germ can’t do all of Mes’s tricks just yet, but this is the first thing that Germ can do that Mes cannot.

What it looks like

Here are some shell commands you can run in your own home to see the results for yourself:

git clone https://git.ngyro.com/germ/
cd germ
git checkout wip-guix-boot
guix shell -m manifest.scm
autoreconf -vif
./configure
make germ0-start germ0-boot.scm dist
gunzip -c germ-0.1.0.tar.gz  \
    | ./etc/make-ses.scm     \
    > germ-0.1.0.ses
guix build -f guix/boot.scm

If everything goes well, you should see something like:

successfully built /gnu/store/...-hello-2.12.3.drv
/gnu/store/...-hello-2.12.3

Of course whether this works depends on the version of Guix you’re using, but the build-side code rarely changes.

Let’s look at the packages of the “guix/boot.scm” file to see what’s going on.

Package 0: guile-via-germ0-seed

This is the bootstrap seed. It takes three inputs: the “germ0-start” binary, the “germ0-boot.scm” script, and an install script which is made using plain-file from Guix. The install script copies the binary and the bootstrap script into the store, and sets the shebang line of the bootstrap script to be the location of the binary. It calls this script “guile” so that we can use it for build-side Guix code. Note that this uses an opaque binary “blob”, but it’s very small and it’s the only one we need.

This package uses a custom “raw” build system. This is modelled after the bootstrap Guile package. It creates a derivation whose builder is just a single command. In this case, the command is like

germ0-start germ0-boot.scm \
    germ0-install.scm germ0-start germ0-boot.scm

but with each part replaced with a full store path.

Package 1: guile-via-germ0

This package uses the previous one as it’s Guile, and runs a G-expression via the “trivial” build system. This expression does two things:

  1. It unpacks the Germ sources, and
  2. It runs the “scm/germ0-to-germ0.scm” install script contained therein.

The first point is interesting, because the Germ sources are packaged as a Scheme program. To unpack them, you simply run the program! This is a nice way to avoid building a program for unpacking a tarball until later.

The output of this package is much the same as the first one, except that we built it ourselves. We are now “bootstrapped”.

Package 2: guile-via-germ-sans-image

This one is an intermediate step. It uses Germ0 to build regular Germ, but it cannot create an image because Germ0 has no way to execute another program. This is because the exec family of system calls requires a list argument, and I left support for that out of Germ0 to save space. Hence, we build the full Germ twice.

Package 3: guile-via-germ

This is the final Germ package. It is built by Germ, and improves on the previous package by including a pre-built image for fast startup. It will serve as the Scheme interpreter for all future packages.

Package 4: hello-germ

This is the main attraction. This is the hello package, but with Guile replaced by Germ (using the #:guile package argument). This requires Germ to run the gnu-build procedure from the GNU build system.

What it took

Loads! I’m looking at 128 commits now on the WIP branch I’m using for this! The changes fall into three broad categories: new features, bug fixes, and heinous hacks.

New features

Install scripts

Germ now has three install scripts (besides the standard GNU tools). One builds Germ0 from Germ0, the next builds Germ from Germ0, and the last builds Germ from Germ. In order for these to work, both Germ and Germ0 needed to learn how to create and change to directories, as well as set file permissions.

Furthermore, in order to run these scripts from the Guix trivial build system, Germ0 needed to be able to read environment variables.

Exceptions

Germ now has structured exception objects (with inheritance and everything). I’ve also added the two relevant SRFI’s: 34 (Exception Handling for Programs) and 35 (Conditions). On top of this, system calls now generally raise structured errors, which catch converts to old-style Guile system call errors.

scandir

Germ now has the scandir procedure that I wrote for Mes. It also has file-system-fold taken from Guile.

Random access, multi-mode ports

The ports system now supports random access for file ports. This is built on the lseek system call. (String ports should support this, too, but I didn’t get around to it.) Additionally, a single port can be an input and output port now. There’s some details about when ports get flushed that I may have borked, but I did think hard about it!

Bignums

I ran into the famous Y2K4 bug that all Unix users face when using 31-bit signed integers to represent time. So now Germ supports bignums (arbitrary precision integers). These bignums were bigdeals to implement. (I almost wrote a whole post about how division works, but I decided it’s actually pretty tedious.) This all happens in Scheme. Every arithmetic operation now has to be generic, and fixnum arithmetic has to be checked for overflow. I thought this would be a major performance problem, but so far I haven’t noticed when running regular code. The fibonacci benchmark (see my previous post) takes 3.3 times as long, but it’s artificially arithmetic heavy.

It took a long time to write the bignum code. Partly it was hard, but maybe more importantly it was boring. After figuring out how division worked the rest of it was mostly bookkeeping, but very careful bookkeeping. I wrote it with Guile and Guile-QuickCheck in hand, the latter making testing pretty painless. Briefly, I considered using code from the reference implementation of SRFI 77, but decided against it because I was concerned about both its license and its quality. However, I laughed at one of the comments:

; Divide m/n: find q and r such that m = q*n + r, where 0 <= r < n.
; Oh no... time to get out Knuth...

I did indeed get out Knuth, which is always a riot if you have a really thorny problem and a lot of patience for math and history.

POSIX procedures

Many more procedures for interacting with the operating system are now in Germ. Most of them deal with file system metadata, but my personal favourite is fork. I also implemented copy-file using Linux’s sendfile. That’s right: arithmetic is slow but you can copy files as fast as the kernel can sling bytes!

Bug fixes

As usual, many little bugs needed fixing, but I want to recount a bug fixing story, so I’ll only talk about one. It showed up after I implemented bignums. Every once in a while, fixnum? would throw a “wrong number of arguments” exception. The problem is that the source code clearly shows this is impossible. Furthermore, it would happen sporadically. If you change anything about the program, the error might go away or pop up at a completely different time. Sounds like a garbage collector bug... *shudders*

In order to get some more information, I wrapped the fixnum? procedure up so that I could catch the exception and print more details about what’s going on. Doing this showed that it only seemed to happen with the numbers 12 and 13. For a while it was always 13, and I was worried that Germ is refusing to work because it’s superstitious, but when 12 showed up I knew it had to be a real bug. Eventually I made it so that the wrapper was basically:

(define (wrapped-fixnum? obj)
  (or (false-if-exception (fixnum? obj))
      (fixnum? obj)))

That made the errors stop, but obviously it’s ridiculous.

The cause of the error is how fixnum? is defined. It checks that the object isn’t any other type. Germ only has four principle types, and fixnum? is one of them. To check if an object is a fixnum, I wrote a procedure that checks that it is not any of the other principle types:

(define (fixnum? obj)
  (and (not (pair? obj))
       (not (pointer? obj))
       (not (special? obj))))

Germ maintains a very small number of primitives, and so this seemed like a clever way to avoid making fixnum? one of them. Of the other three, both pair? and pointer? are primitives. The last one, special?, was defined in a goofy way:

(define (special? obj)
  (eq? obj (make-special obj)))

(Nerds will appreciate the fact that specials are simply the fixpoints of make-special.) The make-special procedure very unsafely tags the object it’s given as a special. I used it to generate special objects for procedures like eof-object?. However, there are a few objects that are very bad to have in user code. The worst of them is the broken heart object. This object tells the garbage collector that a value has already been moved, and that it should use the new location. If it shows up in user code, it steers the garbage collector into a ditch with a dangling pointer.

What becomes the broken heart object when passed into make-special? That’s right: the numbers 12 and 13! If you ask (fixnum? 13) it will in turn ask (special? 13), which will briefly generate a broken heart object. If you are very unlucky, the garbage collector will run into this value, get confused, and poison your cat.

In order to fix this bug, I replaced make-special with fixnum? (called number? during the Germ boot script) as a primitive. Now the reader has support for generating specials. They are merely literals in the source code. It’s a bit hacky, but I’m glad to be no longer haunted by those numbers.

Heinous hacks

So far Germ has been relatively free of shortcuts, but I’ve added some doozies now.

Fake system

We can’t easily support Guile’s system procedure, as it parses shell commands. Assuming we have no other shell or parser, we would need to pull in Gash for that job, which would be pretty messy for one little procedure. Instead, I’ve included a version of system that simply ignores the one unnecessary use that Guix has for it.

Fake division

Guix keeps track of how much time each “phase” of a build script takes, and presents this information to the user:

phase `build' succeeded after 1.3 seconds

The way it does this is to convert a time object—made up of seconds and nanoseconds—into an inexact number. Specifically, it runs

(+ (time-second delta)
   (/ (time-nanosecond delta) 1e9))

Now, Germ does not support inexact numbers. It’s not impossible to add them to the kernel or to emulate them, but I’m hoping that it won’t be necessary. In the mean time, I’ve added just enough hacks so that the above math works and the result can be consumed by format’s floating-point printer.

Now what?

In terms of achieving bootstrapping goals, the next step would be to try Germ on the full Guix bootstrap. Rather than just building the hello package, it would have to build everything up until Guile itself can be built. This is within reach, but it’s a pretty frustrating time (speaking from past experience). Even if that worked, we would still be using %bootstrap-guile for Gash and company. Fixing that involves porting those things to Germ. That should be doable, since I’ve already done much of that work for Mes. If those two parts worked, we could get rid of %bootstrap-guile (on x86, anyway).

Another question is performance. The fun side of that is speeding up the interpreter itself. Germ takes about 2.5× the time that Guile 1.8 does to compute (fib 35) fully recursively. Doing some experiments, I can get the C version of Germ close to parity (hooray computed gotos). That’s still an order of magnitude slower than modern Guile. It’s also 3–4× the time it takes Chibi to compute it. Ekaitz Zarraga has been working on implementing a compiler and bytecode interpreter for Mes, and honestly I’ve got a little FOMO. However, I don’t have good insight into what’s actually slowing Germ down outside of benchmarks. I could work on making procedure calls twice as fast, or I could work on making half as many procedures calls—it’s the same outcome! On that front I could speed up I/O by making system calls work directly on bytes in the heap, rather than going through an intermediary step. Or I could add let or even letrec to the kernel instead of emulating them with lambda (making temporary closures along the way). Plenty of experiments to try.

There’s also the need to run on other architectures. I’ve worked to keep the Scheme code portable, and test it from time to time by running it on an x86_64 kernel (built from C sources). It’s probably not perfect, but it gives me confidence that a port would require minimal Scheme changes. Porting the assembly code (and the assembler) would be a substantial effort. Honestly, it would be kinda relaxing—like doing a puzzle. The problem is that doing so would entrench the current design, and make changes very difficult. Until I’m sure about the design, I’m not interested in porting.

Parting words

Ending blog posts with a big fat list of jobs is a bit of a downer. This is supposed to be a celebration of a milestone: go forth and bootstrap your gexps! (Don’t actually Forth, though—LISP rules; Forth drools!)