The database consists of two primary tables, “fods” (fixed-output derivations) and “commits”. They have a many-to-many relationship managed by the table “fod_commit_links”. The table “fod_references” contains some more information about each fixed-output derivation (described below). Finally, the table “failed_fetches” is a log for when the “identify” procedure cannot fetch the source of a fixed-output derivation.
This query lists all the Git references that are missing from the Software Heritage Archive.
SELECT f.algorithm, f.hash, fr.reference FROM fods f LEFT JOIN fod_references fr USING (fod_id) WHERE NOT f.is_in_swh AND fr.reference_type = 'git'
This query lists all the Git references that could not be processed (because they were unavailable during processing or because of a bug in the processing script). This query excludes recursive Git references, as they cannot be processed yet.
SELECT f.algorithm, f.hash, fr.reference FROM fods f LEFT JOIN fod_references fr USING (fod_id) WHERE f.is_in_swh IS NULL AND fr.reference_type = 'git' AND fr.reference NOT LIKE '%(recursive? #t)%'
This query lists all the “.tar.gz” references that are missing from the Software Heritage Archive.
SELECT f.algorithm, f.hash, fr.reference FROM fods f LEFT JOIN fod_references fr USING (fod_id) WHERE NOT f.is_in_swh AND fr.reference_type = 'tar-gz'
This query lists all the “.tar.xz” references that could not be processed (because they were unavailable during processing or because of a bug in the processing script).
SELECT f.algorithm, f.hash, fr.reference FROM fods f LEFT JOIN fod_references fr USING (fod_id) WHERE f.is_in_swh IS NULL AND fr.reference_type = 'tar-xz'
The rows of this table represent fixed-output derivations from the Guix package graph.
INTEGER PRIMARY KEY
A unique integer identifying the fixed-output derivation.
TEXT NOT NULL
The algorithm used to compute the hash (e.g., “sha256”).
TEXT NOT NULL
The hash of the source.
TEXT
The Software Heritage ID (SWHID) of the source.
INTEGER
(Boolean)
Whether or not the source is in the Software Heritage
archive. If this is FALSE
, then the source
could not be found in the archive. If this
is NULL
, then it has not been checked.
INTEGER
(Boolean)
Reserved for future use. 🙂
TEXT NOT NULL
The serialized derivation à la write-derivation
.
The rows of this table represent commits in the Guix Git repository.
INTEGER PRIMARY KEY
A unique integer identifying the commit.
TEXT NOT NULL UNIQUE
The commit ID (hash) from Git.
INTEGER
(Unix timestamp)
The time the commit was pushed. This comes from the Guix Data Service, which in turn gets it from the guix-commits mailing list.
A simple table linking fixed-output derivations to commits. Every record is unique (the primary key is both columns).
INTEGER NOT NULL REFERENCES fods (fod_id)
A pointer to a fixed-output derivation.
INTEGER NOT NULL REFERENCES commits (commit_id)
A pointer to a commit.
A simple table attaching “references” to fixed-output derivations. Every record is unique (the primary key is both columns).
INTEGER NOT NULL REFERENCES fods (fod_id)
A pointer to a fixed-output derivation.
TEXT
The type of the reference. See the schema below for
possible values. Note that this column may
be NULL
, which means that the type is
unknown.
BOOLEAN
A flag indicating that a reference is probably a mistake and should not be used.
TEXT NOT NULL
A serialized form of the “uri” field of the origin
record that corresponds to the fixed-output derivation.
For a plain URL, this will be a serialized
string: "https://..."
. For a Git reference,
it will look like:
(git-reference (url "...") (commit "...") (recursive? #f))
It will be similar for other kinds of references (Subversion, CVS, etc.). Oftentimes the reference will have more information and be easier to use than the serialized derivation, which makes it extremely useful.
This table records failed attempts at identifying (finding the SWHID) of fixed-output derivations. It’s main purpose is to prevent trying the same fixed-output derivation over and over. Every record is unique (the primary key is both columns).
INTEGER NOT NULL REFERENCES fods (fod_id)
A pointer to a fixed-output derivation.
INTEGER
(Unix timestamp)
The time that the identification attempt failed.
TEXT
The type of the failure. See the schema below for possible values.
CREATE TABLE commits (commit_id INTEGER PRIMARY KEY, hash TEXT NOT NULL UNIQUE, push_time INTEGER); CREATE TABLE fods (fod_id INTEGER PRIMARY KEY, algorithm TEXT NOT NULL, hash TEXT NOT NULL, swhid TEXT, is_in_swh INTEGER, is_in_disarchive INTEGER, derivation TEXT NOT NULL, UNIQUE (hash, algorithm)); CREATE TABLE fod_commit_links (fod_id INTEGER NOT NULL REFERENCES fods (fod_id), commit_id INTEGER NOT NULL REFERENCES commits (commit_id), PRIMARY KEY (fod_id, commit_id)); CREATE INDEX fod_commit_links_reverse_index ON fod_commit_links (commit_id); CREATE TABLE reference_types (reference_type TEXT NOT NULL PRIMARY KEY); INSERT INTO reference_types VALUES ('git'), ('svn'), ('hg'), ('cvs'), ('bzr'), ('tar-gz'), ('tar-xz'), ('tar-bz2'), ('tar'), ('zip'), ('text'); CREATE TABLE fod_references (fod_id INTEGER NOT NULL REFERENCES fods (fod_id), reference_type TEXT REFERENCES reference_types (reference_type), -- The 'is_error' flag is set for rows that are clearly a mistake. -- Sometimes, if the reference is wrong but the file is already in -- the Guix store, a committer won't notice that the reference is -- bogus. This flag lets us ignore those references. is_error BOOLEAN NOT NULL DEFAULT FALSE, reference TEXT NOT NULL, PRIMARY KEY (fod_id, reference)); CREATE TABLE failure_types (failure_type TEXT NOT NULL PRIMARY KEY); INSERT INTO failure_types VALUES ('bail'), ('verify'), ('fetch'), ('disarchive'); CREATE TABLE failed_fetches (fod_id INTEGER NOT NULL, reference TEXT NOT NULL, fetch_time INTEGER NOT NULL, failure_type TEXT REFERENCES failure_types (failure_type), PRIMARY KEY (fod_id, reference) FOREIGN KEY (fod_id, reference) REFERENCES fod_references (fod_id, reference));