[WIP] Prototype database rework
This PR implements the changes outlined in #183, restructuring the database schema to better support multiple distributions, architectures, and releases in a single database. The current code is in a prototype stage, but functions externally the same as rebuilderd does today.
Care has been taken to preserve the behaviour of the current API on the /v0 endpoint, but fully utilizing the new database schema will require API changes as well. I will add a prototype implementation of that on a /v1 endpoint for review and discussion.
The changes are significant across most of the codebase, though the overall flow of the program remains the same. The most obvious changes are more complex SQL queries with various joins, a greater reliance on targeted SQL queries in the various endpoints (i.e., not loading more data than we need), and the use of new distro-agnostic naming conventions.
My primary goal with opening this PR now is to get feedback and eyes on the implementation so I'm not going in the completely wrong direction - any and all thoughts are appreciated.
I've now added a proposal for a v1 REST API - please take a look at that as well. https://github.com/kpcyrd/rebuilderd/blob/55d7b8564882f6b75c4cf9fa28d014a72609dd0f/contrib/docs/rebuilderd-v1.yml
As of today, this PR does what it says on the tin!
I have tested most operations locally and everything appears to work perfectly fine, so I'm calling this good for initial review. I'll try and hold off on any further major changes until @kpcyrd and some community members take a look at the fundamentals so we can course correct if anything is off.
Please be aware that even though the PR is ready for initial review, it is by no means finished. There's a lot of new code and a significant number of touches all across the codebase with a completely new database schema and REST API surface. There are removed parts that probably need to be reintroduced and loads of TODOs sprinkled here and there.
hey, can we pick this up? trixie will be released soon and I'd like to test forky and unstable as well...! :)
I'd could setup a test system if this would help but as I understand then comment from June 16 the code is not there yet.
@h01ger A test system would be great! The code should actually run fine (and will migrate from an existing database, but will need some manual data fixups).
In my own tests, the services run, the API works, and the database ingests packages as expected. A review of the design as a whole is needed by @kpcyrd but it should absolutely be functional as it stands right now.
On Tue, Aug 05, 2025 at 06:33:51AM -0700, Jarl Gullberg wrote:
@h01ger A test system would be great! The code should actually run fine (and will migrate from an existing database, but will need some manual data fixups).
hm, ok. so the migration code is buggy or lacks some features?
In my own tests, the services run, the API works, and the database ingests packages as expected. A review of the design as a whole is needed by @kpcyrd but it should absolutely be functional as it stands right now.
alright! will see how+when I can do a setup! thanks!
-- cheers, Holger
⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org ⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C ⠈⠳⣄
"tja" - a German reaction to the apocalypse, dawn of the gods, nuclear war, an alien attack or no bread in the house.
@h01ger The primary thing is that a new column was introduced in the "identity" of packages - a migrated database will run just fine, but if you want to merge any databases you need to populate that column with which release a package belongs to (trixie, forky, bookworm, rawhide, etc). It's this one, specifically: https://github.com/kpcyrd/rebuilderd/pull/184/files#diff-dc42c6b4e09bb29e710ec734880b8f48645ce690d96e014741e40868479d7159R67 Pretty easy to do with some SQL as a one-time data fixup, but it needs information that simply isn't in the database today.
Give me a ping if you have any questions or run into any trouble!
On Tue, Aug 05, 2025 at 07:56:49AM -0700, Jarl Gullberg wrote:
@h01ger The primary thing is that a new column was introduced in the "identity" of packages - a migrated database will run just fine, but if you want to merge any databases you need to populate that column with which release a package belongs to (trixie, forky, bookworm, rawhide, etc). It's this one, specifically: https://github.com/kpcyrd/rebuilderd/pull/184/files#diff-dc42c6b4e09bb29e710ec734880b8f48645ce690d96e014741e40868479d7159R67 Pretty easy to do with some SQL as a one-time data fixup, but it needs information that simply isn't in the database today.
so if I migrate one db to this new setup/branch, things will just work, however adding my other databases with other archs will need some minimal sql query?
-- cheers, Holger
⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org ⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C ⠈⠳⣄
You can cut all the flowers, but you can’t keep spring from coming. #transdayofremembrance #berlin (2024)
That's correct :)
On Wed, 6 Aug 2025, 10:37 Holger Levsen, @.***> wrote:
h01ger left a comment (kpcyrd/rebuilderd#184) https://github.com/kpcyrd/rebuilderd/pull/184#issuecomment-3158263043 On Tue, Aug 05, 2025 at 07:56:49AM -0700, Jarl Gullberg wrote:
@h01ger The primary thing is that a new column was introduced in the "identity" of packages - a migrated database will run just fine, but if you want to merge any databases you need to populate that column with which release a package belongs to (trixie, forky, bookworm, rawhide, etc). It's this one, specifically: https://github.com/kpcyrd/rebuilderd/pull/184/files#diff-dc42c6b4e09bb29e710ec734880b8f48645ce690d96e014741e40868479d7159R67 Pretty easy to do with some SQL as a one-time data fixup, but it needs information that simply isn't in the database today.
so if I migrate one db to this new setup/branch, things will just work, however adding my other databases with other archs will need some minimal sql query?
-- cheers, Holger
⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org ⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C ⠈⠳⣄
You can cut all the flowers, but you can’t keep spring from coming. #transdayofremembrance #berlin (2024)
— Reply to this email directly, view it on GitHub https://github.com/kpcyrd/rebuilderd/pull/184#issuecomment-3158263043, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR3MTEKLZL37VMI6RWQPDD3MG5GJAVCNFSM6AAAAAB6YBKFIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCNJYGI3DGMBUGM . You are receiving this because you authored the thread.Message ID: @.***>
hm, what conflicts are there github is complaining about now?
It's just a package dependency file - nothing blocking.
I could successfully build this on trixie and just got one warning:
Compiling rebuildctl v0.23.1 (/home/holger/rebuilderd/tools)
warning: unused imports: `QueuedJob` and `RebuildReport`
--> worker/src/rebuild.rs:9:21
|
9 | ArtifactStatus, QueuedJob, QueuedJobArtifact, RebuildArtifactReport, RebuildReport,
| ^^^^^^^^^ ^^^^^^^^^^^^^
|
= note: `#[warn(unused_imports)]` on by default
warning: unused variable: `sync`
--> tools/src/main.rs:179:42
|
179 | SubCommand::Pkgs(Pkgs::SyncStdin(sync)) => {
| ^^^^ help: if this is intentional, prefix it with an underscore: `_sync`
|
= note: `#[warn(unused_variables)]` on by default
warning: `rebuilderd-worker` (bin "rebuilderd-worker") generated 1 warning (run `cargo fix --bin "rebuilderd-worker"` to apply 1 suggestion)
warning: `rebuildctl` (bin "rebuildctl") generated 1 warning
Compiling rebuilderd-tests v0.23.1 (/home/holger/rebuilderd/tests)
Finished `release` profile [optimized] target(s) in 2m 44s
unfortunatly we hit this:
Failed to run pending migration: Failed to run 2025-06-01-095202_redesign_database with: NOT NULL constraint failed: rebuilds.build_log
For the migration testing in Debian, what I'm interested in is a listing of all unique architecture/package_name/version BinaryPackages in a set of releases. I don't need the end point to distinguish the releases (as most packages in unstable and testing will be the same). So very similar to v1/packages/binary?release=, but then with support for more than one release per URL (so v1/packages/binary?releases=unstable,forky) and I don't need the duplicate BinaryPackage per release, if the version hasn't changed.
Something like the unique of v1/packages/binary?releases=unstable and v1/packages/binary?releases=forky where the info of BinaryPackage is reduced to architecture/binary_package_name/version.
I know this is rather geared towards the use of Debian's britney2, but I'll be querying this end point once per hour, so I think it might be worth a bit of optimization. Or should all Debian rebuilders take care of this from their side? (I envision I'll be querying multiple rebuilds at some moment in the future, to check if the (trusted) rebuilders agree.)
For the migration testing in Debian, what I'm interested in is a listing of all unique architecture/package_name/version BinaryPackages in a set of releases. I don't need the end point to distinguish the releases (as most packages in unstable and testing will be the same). So very similar to
v1/packages/binary?release=, but then with support for more than onereleaseper URL (sov1/packages/binary?releases=unstable,forky) and I don't need the duplicate BinaryPackage per release, if the version hasn't changed.Something like the unique of
v1/packages/binary?releases=unstableandv1/packages/binary?releases=forkywhere the info of BinaryPackage is reduced to architecture/binary_package_name/version.I know this is rather geared towards the use of Debian's britney2, but I'll be querying this end point once per hour, so I think it might be worth a bit of optimization. Or should all Debian rebuilders take care of this from their side? (I envision I'll be querying multiple rebuilds at some moment in the future, to check if the (trusted) rebuilders agree.)
Adding in support for querying multiple releases in a single API call should be pretty easy. A unique-only query parameter should also be possible, but the behaviour of which entry in the DB is returned might not be super intuitive.
I've updated the schema to push blobs out into their own tables and edited the proposed logic to copy build results to newly-registered packages for identical inputs on the same distribution. Additionally, incoming results will be copied to other releases with the same build input as well.
We've been running this rework successfully on Debian's side for a few days now and I believe we've ironed out most of the kinks. The PR is ready for full review and I'm happy to make any required changes.
Currently the new API only allows filtering by source package name, can you also add an option to filter by binary package name?
As mentioned on https://github.com/kpcyrd/debian-repro-status/issues/19 the missing v0 API causes reproduce.debian.net to not work with the debian-repro-status shipped with Trixie/Stable. This is unfortunate since the release notes encourage users to check the reproducibility of their systems with it. I also filed a bug on the debian package and we discussed this already at #debian-reproduce. As far as I can see the consent was that the best course of action would be to reenable the v0 api so debian-repro-status works again. So imho this merge request should be changed to include the v0 api again so not to break the Trixie package.
I think the DB design is in a stable enough state to bring the v0 API back in a read-only form for backwards compatibility - let me see what I can do! I'm traveling at the moment so I might not be able to get it done before Vienna.
kpcyrd also said on IRC yesterday that he considered the existence of the v0 API as a blocker for merging this. (He won't merge if the API doesnt exist.)
He also apologized for not stating this earlier. :)
And I'm too lazy to provide proper quotes. If I quoted kp wrong, i'm sure he will correct me here. :)
Thanks for bringing back the v0 API! Currently debian-repro-status in forky reports BAD for gcc-15-base, libgcc-s1, libpam-runtime and libstdc++6 even though they are GOOD when looking them up on the website. Can you have a look?
Will check! Probably an ordering issue.
v0 appears to be working fine in read-only mode now.
I am trying to update my Debian 13 rebuilderd db to this branch but failed at migration:
daemon-1 | Error: Failed to run pending migration: Failed to run 2025-06-01-095202_redesign_database with: UNIQUE constraint failed: index 'source_packages_unique_idx'
pkbases id=6 | name=389-ds-base | version=3.1.2+dfsg1-1 | distro=debian | suite=main | architecture=amd64 id=19889| name=389-ds-base | version=3.1.2+dfsg1-1 | distro=debian | suite=main | architecture=all
The migration performs: INSERT INTO source_packages(id, name, version, distribution, component) SELECT id, name, version, distro, suite FROM pkgbases;
This copies both entries with their original IDs, but the new unique index requires: CREATE UNIQUE INDEX source_packages_unique_idx ON source_packages ( name, version, distribution, COALESCE("release", 'PLACEHOLDER'), COALESCE(component, 'PLACEHOLDER') );
Since the architecture column is not included in the INSERT, this creates duplicates that violate the unique constraint.
@cen1 Thanks for the report! So far all test databases have been single-architecture, but it should definitely be a supported migration path either way. Will put it on the list for hacking at the conference!
@cen1 Poking at this now - do you have a way to share your database? I'm curious as to how you ended up with an entry for 389-ds-base and an architecture of all - 389-ds-base is an any package and should not have been part of an all sync.
EDIT: Okay, now I see - it's because of python3-lib389 and cockpit-389-ds. Makes more sense now - I'll keep digging, but I could still use a copy of the problematic DB to assist!
Here you go: https://debian-rb.xpam.pl:2096/rebuilderd_stats/rebuilderd.db
@Nihlus was great to meet you and discuss this. Just as a reminder here is my whishlist ;):
- support manual scheduling again. Currently bad and failed packages are immediately added again to the queue so we can't schedule them manually. Would be great to somehow get them on the top of queue and without the additional
next_retrydelay. - block specific versions of packages that will never reproduce from being rescheduled again and again.
- support delaying new entries from the
pkgs syncto make sure all needed fails reached the right servers. - support an API and point that takes a source package name and version or release and gives back the state of all binary packages.
- support udebs.
Thanks!
Multi-architecture migration has been fixed and I am now ready to merge. @jspricke, thanks for capturing the wishlist! I'll start checking stuff off once we have the PR merged and can start doing more targeted features/fixes.