Germ as a C Compiler

Germ can run MesCC.

Introduction

Germ can now compile basic C code using MesCC, and that’s a nice milestone to write a blog post about! I have a bunch to say about it, but I want to start by showing how you can try it for yourself. Note that Germ only runs on the Linux kernel and only on x86-compatible architectures (I’m running Guix on top of x86_64). Also, I assume you have access to the Guix package manager. If not, you will have to scrounge together the dependencies some other way.

Demo

For the demo, I will assume that we’re working out of a single directory. That is, the Mes and Germ checkouts should live in the same place. Note that building Mes is quite slow (dozens of minutes) and building the MesCC image with Germ is also quite slow (about 10 minutes).

Obtaining and building Mes

First, we need to checkout and build Mes 0.27.1:

git clone -b v0.27.1 https://git.savannah.gnu.org/git/mes.git
cd mes
guix shell -D -f guix.scm
CC=i686-unknown-linux-gnu-gcc ./configure
make
exit
cd ..

The reason we do this is because we will need the headers and libraries from its C library. Germ might be able to build everything itself, but I haven’t tried that yet. For now, we will simply link with the Mes-built libraries.

Getting the MesCC demo version of Germ

I’ve prepared a special branch for this demo called demo-mescc (which will only be available temporarily – sorry readers from the distant future). This provides the NYACC and MesCC modules, as well as changes to them and Germ to make things “just work”. Checkout this branch and build Germ:

git clone -b demo-mescc https://git.ngyro.com/germ/
cd germ
guix shell -m manifest.scm
autoreconf -vif
./configure
make
exit

Building a MesCC image

Germ supports serializing a program’s state to disk, and this is used as a way to “compile” programs. The demo branch includes a script you can use to make a MesCC executable:

./pre-inst-germ mescc-image.scm "$(pwd)"/germ-continue pre-inst-mescc
chmod +x pre-inst-mescc

Our test subject

We now need a C program to compile. We don’t dare challenge the status quo:

#include <stdio.h>

int
main ()
{
  puts ("Hello world!");
  return 0;
}

Save this as “hello.c”.

Invoking the compiler

MesCC relies on an assembler called “M1” and a linker called “hex2”. Get them using Guix:

guix shell -e '(@@ (gnu packages commencement) stage0-posix)'

Finally, we can run MesCC using Germ:

./pre-inst-mescc --arch=x86 -I ../mes/include \
    -L ../mes/lib -L ../mes/mescc-lib -o hello hello.c

Et voilà:

./hello
Hello world!

What’s new

There are 34 new commits and a fair bit of cleaning up the branch history since getting NYACC working. Let’s look at some highlights.

Clean up

All the shortcuts from the last round of hacking are gone. Each change at the kernel level has been copied to the Scheme version as well as the C version. I’ve been using the C version of the kernel to test the portability of the growing base of Scheme modules shipped with Germ. I can build a working x86_64 version of the kernel, and use that to make sure all the system calls made from Scheme code use the correct IDs and word size depending on the architecture. This reassures me that an ARM or RISC-V port would be possible and would not be full of unpleasant surprises.

On top of that, I’m rather confident now with the kernel changes and the module system, so I’ve merged the WIP branches into the main branch.

Image loading

My original architecture for image loading was to use the same kernel utility, and have it run specialized boot code to load an image. Now, there’s a specialized utility that loads an image directly. This means there are two utilities for each implementation of Germ: one called “start” and the other called “continue”. The “start” utility reads source code from scratch, while the “continue” utility loads a pre-built image. This means that loading an image takes only a few milliseconds. I’ve also taken to using the “continue” utility to add a shebang line to images, so that they can be executed directly. Now, the final germ utility is just an image that is about to process command line arguments and either run a script or launch a REPL as requested.

Lists in system calls

MesCC calls out to other utilities using the system* procedure. Under the hood, this uses the fork, exec, and wait system calls. The exec system call is a little peculiar in that you have to pass it lists: one for the arguments and one for the environment. To the Linux kernel, a list of strings is a NULL-terminated sequence of pointers to NUL-terminated sequences of bytes.

To support this style of system call, the syscall primitive now accepts “l” as a parameter specification. This works basically like the old “p” specification, except that it adjusts a NULL terminated list of pointers already in the system call buffer.

Using this, and a lot of Scheme code support, we can make exec system calls from Scheme code. The tricky part from the Scheme side is loading (and aligning) the strings correctly in the system call buffer. That buffer also had to be made larger to support storing an entire environment (full of store paths because I’m running Guix). I’ve considered using bytevectors for this in the past, but so far don’t see an urgent need for it.

Misc. Scheme improvements

There’s a never-ending list of Guile features that are slowly creeping into Germ. Trying to be brief, here’s a few of the most interesting ones:

It’s getting much easier to add features now, too. New code can make use of match. Existing code can be copied from Guile and expected to work without modification. Having back traces and near-instant start up makes testing new code almost fun! (Most things are fun compared to debugging an unadorned “Segmentation fault” that takes a minute to reproduce.)

A bit of a snag

One of the biggest problems I encountered while doing this is that NYACC relies on the evaluation order of operands in a few places. As an example, it has code like:

(define (do-stuff first-char second-char)
  (frobnicate! first-char second-char))

(do-stuff (read-char) (read-char))

Now, if you process the operands from left to right, this code makes perfect sense. However, Scheme does not specify the order in which the operands are evaluated, and portable code should not depend on it. I was hoping to take advantage of that when I simplified things by evaluating the operands from right to left. That means that when running on Germ, the first character read would be called second-char while the second would be called first-char. This cropped up four times in the NYACC code exercised to parse “hello.c”.

So far, I’ve just patched NYACC. It wouldn’t be too hard to change Germ, but maybe it would be better if it caught this kind of thing. It’s something I’ll have to ponder.

Looking ahead

I’ve already started a wip-guix-build branch to tackle running Scheme code from Guix’s “build” modules. With some beefing up of its reader and expander, Germ can load the gnu-build-system module (along with its dependencies). This is nice because there are a few gnarly macros in there, and Germ seems to be able to chew through them just fine (famous last words – I’m sure it will fall over as soon as it tries to do anything with them). The goal here is for Germ to build the commencement Guix packages (i.e., Guile on the host side, Germ on the build side). I guess a good start would be the stage0-posix package, which uses the trivial-build-system and a handful of procedures from (guix build utils). From there, I could move on to mes-boot to try the gnu-build-system.

The other half of that job is getting Germ built from scratch in Guix. I have a package that builds germ0, but I still need to use that to build the rest of Germ with all of its modules. This will likely use the same self-extracting Scheme trick I used for Bootar (here’s an example of it). It’s basically just a simplified tarball rendered as an S-Expression with enough Scheme code to write the files to disk.

I think I’ll leave the future work at that. There’s a bunch to do with MesCC and Gash, but the Guix build code is a big enough job for now.