Support FoundationDB for consensus
Implements a consensus backend to talk to FoundationDB. Very much untested and still panics sometimes.
Run: bin/environmentd --optimized --consensus foundationdb:
TODO to make this work everywhere:
- Make building with FoundationDB optional. Currently, there is no easy way to get a
libfdb_con aarch64 Mac. - This change adds the fdb library to the base images, which isn't great if we don't want to use it.
- The test infrastructure has different ways to distinguish metadata stores, sometimes a string, or a bool, to switch between the built-in postgres or external crdb. It'd be nice to change this to an enum of Internal, External-CRDB, External-FDB to allow easy switching between different implementations and targets.
- Testdrive passes most tests, but the consistency and tombstone checks don't work as it assumes an incorrect store.
- Initializing FDB requires a
fdbclicall to create the database. Before that, we cannot establish a connection. If we just use the provided image, we need to slot that call somewhere outside of docker-compose itself. See https://github.com/apple/foundationdb/blob/main/packaging/docker/samples/local/start.bash for context.
Still losing data somewhere, it's getting closer though:
2025-10-10T15:18:10.342212Z thread 'persist:001f' panicked at /home/moritz/dev/repos/materialize/src/persist-client/src/internal/state_versions.rs:771:9:
assertion `left == right` failed
left: Some(SeqNo(254))
right: Some(SeqNo(257))
Tell me if you want any help with setting up FoundationDB in mzcompose. I'm very interested to see benchmark results, as well as the limits test to find new limits (with its artificial limits because of things becoming too slow removed).
I added a testdrive variant that runs against FoundationDB, but it requires a bunch of changes to make Mz compile in docker. One issue is that the FoundationDB client library needs to be dynamically linked, which is a novel problem for us. At the moment, everything in Materialize is statically linked, so we need to make sure that our base images contain the right library for the compile and runtime to be happy.
Feature Benchmark looks pretty similar, ~2% slower on average: https://docs.google.com/spreadsheets/d/1iC-gxHKOgz-kkQDKgsq_aem-Y_8cVMZaeBDqT1sR5JE/edit?gid=2146535294#gid=2146535294 But it doesn't benchmark DDLs mostly. Scalability Benchmark showed a few slight regressions: https://buildkite.com/materialize/nightly/builds/13764 Parallel Benchmark has INSERTs being slower: https://buildkite.com/materialize/nightly/builds/13766 The limits test has some Pg connections being closed for unknown reason: https://buildkite.com/materialize/release-qualification/builds/969, maybe the rest will have some interesting results for whether we can have more objects using FDB
The feature benchmark had a few small regressions in SmallInserts and Subscribes: https://buildkite.com/materialize/nightly/builds/13768
When enabling FoundationDB consensus in Parallel Workload with 10x the number of objects (to stress it a bit more), I'm seeing a novel panic: Parallel Workload (0dt deploy)
parallel-workload-materialized-1 | thread 'tokio:work-2' panicked at /var/lib/buildkite-agent/builds/buildkite-l-builders-x86-64-static-4e3f139-i-0d002e61edf47c1a8-1/materialize/test/src/storage-controller/src/collection_mgmt.rs:1186:21: error truncating metrics history: appending retractions: UpperMismatch { expected: Antichain { elements: [1761082474918] }, current: Antichain { elements: [1761082475918] } } (type=WallclockLagHistory)
Could it be related to FoundationDB? If not I'll open a separate issue. Edit: I couldn't reproduce it without FoundationDB, tried in https://github.com/MaterializeInc/materialize/pull/33907 Edit2: I have opened an issue: https://github.com/MaterializeInc/database-issues/issues/9824