Preservation of Guix Schema

Overview

The database consists of two primary tables, “fods” (fixed-output derivations) and “commits”. They have a many-to-many relationship managed by the table “fod_commit_links”. The table “fod_references” contains some more information about each fixed-output derivation (described below). Finally, the table “failed_fetches” is a log for when the “identify” procedure cannot fetch the source of a fixed-output derivation.

Graphical
          representation of table relationships.  The table “fods” has
          a many-to-many relationship to “commits”, and a one-to-many
          relationship to both “fod_references” and “failed_fetches”.
Figure 1: Database table relationships.

Example Queries

Show all missing Git references

This query lists all the Git references that are missing from the Software Heritage Archive.

SELECT f.algorithm, f.hash, fr.reference
FROM fods f
    LEFT JOIN fod_references fr USING (fod_id)
WHERE NOT f.is_in_swh
    AND fr.reference_type = 'git'

Show all unknown, non-recursive Git references

This query lists all the Git references that could not be processed (because they were unavailable during processing or because of a bug in the processing script). This query excludes recursive Git references, as they cannot be processed yet.

SELECT f.algorithm, f.hash, fr.reference
FROM fods f
    LEFT JOIN fod_references fr USING (fod_id)
WHERE f.is_in_swh IS NULL
    AND fr.reference_type = 'git'
    AND fr.reference NOT LIKE '%(recursive? #t)%'

Show all missing “.tar.gz” references

This query lists all the “.tar.gz” references that are missing from the Software Heritage Archive.

SELECT f.algorithm, f.hash, fr.reference
FROM fods f
    LEFT JOIN fod_references fr USING (fod_id)
WHERE NOT f.is_in_swh
    AND fr.reference_type = 'tar-gz'

Show all unknown “.tar.xz” references

This query lists all the “.tar.xz” references that could not be processed (because they were unavailable during processing or because of a bug in the processing script).

SELECT f.algorithm, f.hash, fr.reference
FROM fods f
    LEFT JOIN fod_references fr USING (fod_id)
WHERE f.is_in_swh IS NULL
    AND fr.reference_type = 'tar-xz'

Tables in Detail

fods

The rows of this table represent fixed-output derivations from the Guix package graph.

fod_id

INTEGER PRIMARY KEY

A unique integer identifying the fixed-output derivation.

algorithm

TEXT NOT NULL

The algorithm used to compute the hash (e.g., “sha256”).

hash

TEXT NOT NULL

The hash of the source.

swhid

TEXT

The Software Heritage ID (SWHID) of the source. Some fixed-output derivations corresponds to multiple SWHIDs. In that case, this column will contain 'composite', and the SWHIDs will be stored in the “subfods” table.

is_in_swh

INTEGER (Boolean)

Whether or not the source is in the Software Heritage archive. If this is FALSE, then the source could not be found in the archive. If this is NULL, then it has not been checked.

is_in_disarchive

INTEGER (Boolean)

Reserved for future use. 🙂

derivation

TEXT NOT NULL

The serialized derivation à la write-derivation.

commits

The rows of this table represent commits in the Guix Git repository.

commit_id

INTEGER PRIMARY KEY

A unique integer identifying the commit.

hash

TEXT NOT NULL UNIQUE

The commit ID (hash) from Git.

push_time

INTEGER (Unix timestamp)

The time the commit was pushed. This comes from the Guix Data Service, which in turn gets it from the guix-commits mailing list.

fod_commit_links

A simple table linking fixed-output derivations to commits. Every record is unique (the primary key is both columns).

fod_id

INTEGER NOT NULL REFERENCES fods (fod_id)

A pointer to a fixed-output derivation.

commit_id

INTEGER NOT NULL REFERENCES commits (commit_id)

A pointer to a commit.

subfods

This table stores is used to attach multiple SWHIDs to a single fixed-output derivation. This is used, for example, to represent composing multiple directories together from a single Subversion repository.

fod_id

INTEGER NOT NULL REFERENCES fods (fod_id)

A pointer to the parent fixed-output derivation.

ordinal

INTEGER NOT NULL

The position of this component in the list.

location

TEXT NOT NULL

The location to put this component (as a file or directory name).

swhid

TEXT NOT NULL

The SWHID of this component.

is_in_swh

INTEGER (Boolean)

Whether or not this component is in the Software Heritage archive. If this is FALSE, then the component could not be found in the archive. If this is NULL, then it has not been checked.

fod_references

A simple table attaching “references” to fixed-output derivations. Every record is unique (the primary key is both columns).

fod_id

INTEGER NOT NULL REFERENCES fods (fod_id)

A pointer to a fixed-output derivation.

reference_type

TEXT

The type of the reference. See the schema below for possible values. Note that this column may be NULL, which means that the type is unknown.

is_error

BOOLEAN

A flag indicating that a reference is probably a mistake and should not be used.

reference

TEXT NOT NULL

A serialized form of the “uri” field of the origin record that corresponds to the fixed-output derivation. For a plain URL, this will be a serialized string: "https://...". For a Git reference, it will look like:

(git-reference
  (url "...")
  (commit "...")
  (recursive? #f))

It will be similar for other kinds of references (Subversion, CVS, etc.). Oftentimes the reference will have more information and be easier to use than the serialized derivation, which makes it extremely useful.

failed_fetches

This table records failed attempts at identifying (finding the SWHID) of fixed-output derivations. It’s main purpose is to prevent trying the same fixed-output derivation over and over. Every record is unique (the primary key is both columns).

fod_id

INTEGER NOT NULL REFERENCES fods (fod_id)

A pointer to a fixed-output derivation.

fetch_time

INTEGER (Unix timestamp)

The time that the identification attempt failed.

failure_type

TEXT

The type of the failure. See the schema below for possible values.

Raw Schema

CREATE TABLE commits
  (commit_id INTEGER PRIMARY KEY,
   hash TEXT NOT NULL UNIQUE,
   push_time INTEGER);

CREATE TABLE fods
  (fod_id INTEGER PRIMARY KEY,
   algorithm TEXT NOT NULL,
   hash TEXT NOT NULL,
   swhid TEXT,
   is_in_swh INTEGER,
   is_in_disarchive INTEGER,
   derivation TEXT NOT NULL,
   UNIQUE (hash, algorithm));

CREATE TABLE fod_commit_links
  (fod_id INTEGER NOT NULL REFERENCES fods (fod_id),
   commit_id INTEGER NOT NULL REFERENCES commits (commit_id),
   PRIMARY KEY (fod_id, commit_id));

CREATE INDEX fod_commit_links_reverse_index ON fod_commit_links
  (commit_id);

CREATE TABLE subfods
  (fod_id INTEGER NOT NULL REFERENCES fods (fod_id),
   ordinal INTEGER NOT NULL,
   location TEXT NOT NULL,
   swhid TEXT NOT NULL,
   is_in_swh INTEGER,
   UNIQUE (fod_id, ordinal));

CREATE TABLE reference_types
  (reference_type TEXT NOT NULL PRIMARY KEY);

INSERT INTO reference_types
VALUES
  ('git'),
  ('svn'),
  ('hg'),
  ('cvs'),
  ('bzr'),
  ('tar-gz'),
  ('tar-xz'),
  ('tar-bz2'),
  ('tar'),
  ('zip'),
  ('text');

CREATE TABLE fod_references
  (fod_id INTEGER NOT NULL REFERENCES fods (fod_id),
   reference_type TEXT REFERENCES reference_types (reference_type),
   -- The 'is_error' flag is set for rows that are clearly a mistake.
   -- Sometimes, if the reference is wrong but the file is already in
   -- the Guix store, a committer won't notice that the reference is
   -- bogus.  This flag lets us ignore those references.
   is_error BOOLEAN NOT NULL DEFAULT FALSE,
   reference TEXT NOT NULL,
   PRIMARY KEY (fod_id, reference));

CREATE TABLE failure_types
  (failure_type TEXT NOT NULL PRIMARY KEY);

INSERT INTO failure_types
VALUES
  ('bail'),
  ('verify'),
  ('fetch'),
  ('disarchive');

CREATE TABLE failed_fetches
  (fod_id INTEGER NOT NULL,
   reference TEXT NOT NULL,
   fetch_time INTEGER NOT NULL,
   failure_type TEXT REFERENCES failure_types (failure_type),
   PRIMARY KEY (fod_id, reference)
   FOREIGN KEY (fod_id, reference) REFERENCES fod_references (fod_id, reference));