infra A Bloom filter to speed up all 404s

Is your feature request related to a problem? Please describe.

Binary caches impose a hidden cost on virtually all uncached builds, because we wait for 404 responses before starting the build. (see alternative 1)

Describe the solution you'd like

Publish a Bloom filter in the binary cache. This could take the form of a regularly updated snapshot, as well as a sequence of "patch" files so that all copies can be kept up to date on a frequent basis. Depending on the parameters, the Bloom filter would be comfortable in 2GB, given the size of the bucket. For the updates, it's more efficient to send the store path hashes than the indexes of the Bloom filter bits, iiuc, but as I may be ignorant, I'll just say that the upper bound for the updates is 20 bytes per store path. This suggests that a Bloom filter could be synced world-wide onto

ideally, nodes in our current CDN
alternatively, a set of our own hosts in various locations
anyone who wants to have a very fast cache, e.g. a local reverse proxy or a caching proxy on the local network

Note that the Bloom filter will tell us that either

the store object certainly does not exist
the store object may exist (with a high probability, assuming suitable parameters)

This means that the probabilistic nature of it does not pose a problem for returning 404s instantly. When the Bloom filter does "fail" (a collision), it means that query whatever we query today (e.g. the S3 bucket), and still get a 404. Depending on the Bloom filter parameters, this is rare. This situation can also be created intentionally, by removing a store object that is recorded in the Bloom filter. Alternatively, this can be considered an acceptable cost, e.g. when performing a garbage collection, but not replacing the Bloom filter if the number of deletions is insignificant.

Describe alternatives you've considered

We could start builds speculatively, but only if all dependencies are already available. Otherwise we're making things way worse for the 200 OK case, where we want to avoid fetching derivation inputs that aren't part of the output.
Instead of a Bloom filter, or in addition, we could publish the list of cached store paths for download. This is far less efficient, but possible useful for other purposes.
Alternatives exist for the Bloom filter data structure. I have not analyzed the differences. Note that incremental updates are important for delivering new store paths quickly and efficiently.

Additional context

I'm probably not the first to advocate for this idea, but I have no idea who to credit besides @edef1c whom I've chatted about this some time ago.
This could replace or supplement the easier partial solution #769

Jul 14 '25 23:07 roberth

Can we somehow estimate this (hidden) cost? I'd say that cheap builds shouldn't get looked up (they should set allowSubstitutes = false), so we'd be speeding bigger builds that take nontrivial resources – and the best we can do is to start them earlier by... a fraction of second per build?

Jul 15 '25 05:07 vcunat

Nit: (3.) alternatives might be worth considering to reduce the lookup locations, even though they tend to be more complex than Bloom. Classic Bloom filters look up several (about log2 (1/failProbability)) randomly distributed bits in the huge bitmap. Some alternatives, e.g. Cuckoo filters, probe just two segments of several bytes on a lookup. That's in practice much cheaper both locally (memory hierarchies) and when querying over the network.

Jul 15 '25 05:07 vcunat

That's also something I've been thinking about in relationship to https://github.com/numtide/nixos-passthru-cache

One consideration is that downloading 2GB is not nothing. It has to outweight the time saved by the faster 404 queries as vcunat said. Combined with the fact that you don't see new narinfos until a new bloomfilter snapshot is combined, so the window is probably quite small. If we're not careful, we end up with a lot more egress bandwidth.

Part of the answer might be to use https://roaringbitmap.org/ that are supposed to compress better.

Another possibility is to add a stream of new narinfos the caches can subscribe to, and update their local index with.

Jul 15 '25 09:07 zimbatm

2GB: if someone is only interested in binaries for a particular nixpkgs commit, our channels even publish a compressed list of all store paths, 9 MiB or just 5 MiB if you're linux-only.

The bigger one has 368k paths right now, so an approach like bloom filter could compress it further, though for reasonable reliability it has to take megabits for this set size (information-theoretical lower bound). As some middle ground, we might similarly build a filter that spans only recent-ish commits.

EDIT: I should explicitly say that you can do such savings only at the cost of false negatives for non-recent binaries, i.e. with an old nixpkgs commit you would end up building stuff that we do have in cache.

Jul 15 '25 10:07 vcunat

downloading 2GB is not nothing

The way I see it, the primary user of the bloom filter would be the cache.nixos.org internals. Providing this data for others could be seen as a separate goal, but an important one nonetheless.

"Streaming" the new entries is a must have. Doesn't have to be instant, but if builds become available quickly, that removes a worry, and having updates at all removes the need for the scary-ish 2GB download over and over.

I don't have good numbers, but overestimating(?) the number of new store paths at 25k per day, we're looking at only 0.5MB/day of store path hashes to send out.

(they should set allowSubstitutes = false)

It depends. Some small file generating derivations still pull in a linter, which could be wasteful.

As some middle ground, we might similarly build a filter that spans only recent-ish commits.

This loses the ability to confidently provide a 404, and that's the purpose of this issue. To make something like that work, we'd need a lot of user-facing complexity about which Nixpkgs "generation" it is (for lack an unambiguous word), and it'd become harder to combine multiple Nixpkgs versions in a single usage.

The binary cache mechanism should remain as transparent as possible.

Cuckoo

https://roaringbitmap.org/

Yes, that could be a better alternative. We could also make the raw store hashes available, which I believe is on the order of 10GB, so that anyone can play around with it.

Jul 15 '25 11:07 roberth

I don't have good numbers, but overestimating(?) the number of new store paths at 25k per day, we're looking at only 0.5MB/day of store path hashes to send out.

It'd be worth having someone with access confirm this number. Just from some napkin math of rough statistics from hydra.nixos.org, I wouldn't be surprised if we were over 115k/day

Jul 15 '25 11:07 grahamc

Hydra's metrics show a speed of about one build step per second (and it's been so for a long time), but I don't have numbers for the new store path counts.

The way I see it, the primary user of the bloom filter would be the cache.nixos.org internals.

I'm puzzled about that part. I know about plans for putting a layer between S3 and Fastly, but I think that one would know even all *.narinfo, not just set-membership.

Jul 15 '25 11:07 vcunat

Hydra's metrics show a speed of about one build step per second (and it's been so for a long time), but I don't have numbers for the new store path counts.

I did this by sampling https://hydra.nixos.org/steps a few times and seeing at least 40 per minute, sometimes 80, and extrapolating from the queue not looking too long.

Jul 15 '25 11:07 grahamc

Thanks! Looks like my napkin estimate was off by quite a bit there. Let's assume it's 1 build per second, some multi-output, so let's say 3 outputs per second. At 20 bytes per output (the hash part), we have a the following pace for the hash log (let's call it that):

60 B/s
216 kB/h
5.1 MB/day
148 MB/month

So the amount of data to stream out to users is negligible. Also we don't need to persist this for a long time, because at some point it exceeds the size of the snapshot, in which case it's better to just get that. Furthermore these numbers pale in comparison to e.g. the Hydra database, which requires far more than 20 bytes per output.

So to summarize the infrastructural costs:

2 GB of memory per replica
2 GB on disk + perhaps a backup
Insignificant traffic to keep the filter up to date for NixOS's own replicas and for anyone who wants to host their own

Benefits:

Faster queries when building your own stuff: development, NixOS config files, etc
Incentive to host a caching proxy, because that's even faster. This might outweigh the cost, because of the latter, this might even reduce the load on infrastructure. That's hard to quantify though.

Jul 15 '25 13:07 roberth

Yep this is a great idea. I literally was about to open a ticket for this morning as @edef1c noted that #769 will most likely have such a low hit rate that in practise it will not change much as the chance that people fetch the same 404 narinfos is quite small as the keyspace is the size of our hashes...

Thanks for starting the discussion.

Fastly does support arbitrary edge compute in various languages (https://www.fastly.com/products/edge-compute)

We might be able to host the 404 bloom filter check code on there and have the backing bloom filter data be distributed around their various POPs automatically.

Jul 16 '25 07:07 arianvp

Also to further contextualize: we're currently spending around 600$ per month on API usage just for serving 404s from S3.

Though not the lowest hanging fruit for cost saving, it is definitely a non-trivial amount.

Jul 16 '25 07:07 arianvp

Also note that Bloom Filters (and its cousins) form a semilattice (or the new hip term: it's a CRDT). You can take a big bloom filter and a small bloom filter and merge them to get the combined bloom filter. And the order of merging doesn't matter for the end result. This makes it very easy to sync around compressed incremental updates with eventual consistency. We could e.g. publish a bloom filter for each channel bump and we can merge them into the total bloom filter to get the full picture.

Jul 16 '25 09:07 arianvp

One more consideration is that read-after-write consistency for cache.nixos org is important in our current setup. The builders are also pulling from it. Typically they pull dependenies that have been just built by another builder. We could decide to re-configure the builders to pull directly from S3.

Probably the next best step is to sanitize the Fastly logs from PII and publish them somewhere. Then we can all take a look and make more educated guesses on whenever the extra complexity is worth it.

Jul 16 '25 10:07 zimbatm

I don't really see how a bloom filter changes anything regarding that? If a bloom filter reports a 404 we know for certain the path isn't there. If a bloom filter reports a 200 and fastly produces a cache miss, we fall back to S3 to serve either a 200 or a 404 with read-after-write consistency.

It's just important that the bloom filter gets updated after a write to S3 occurs an not before and then I don't think anything changes about the consistency guarantees we have today. In the best case you avoid hitting S3, in the worst case you still hit S3 but read-after-write consistency is maintained in both cases.

on the other hand, with my other proposal #769 there are issues with read-after-write consistency as the cache purge happens asynchronously after upload to S3

Jul 16 '25 10:07 arianvp

It's more of a consideration for the implementation. To provide strong read-after-write guarantees it means that the bloom filter needs to be updated somewhere before Hydra finishes the upload and moves to the next build.

Jul 16 '25 10:07 zimbatm

It's more of a consideration for the implementation. To provide strong read-after-write guarantees it means that the bloom filter needs to be updated somewhere before Hydra finishes the upload and moves to the next build.

Oh yeh I am sorry you're right. The scenario is the same both for caching 404s and the bloom filter. We indeed need to finish updating the bloom filter before considering a path "uploaded" to maintain read-after-write consistency

Jul 16 '25 12:07 arianvp

Probably the next best step is to sanitize the Fastly logs from PII and publish them somewhere. Then we can all take a look and make more educated guesses on whenever the extra complexity is worth it.

I can already give some useful numbers from the fastly metrics and S3 billing:

In the past month we've served 2 billion 404 requests. That's 26.44% of all the requests that fastly serves

The average miss time over the past month was 92ms

Past month we spent $570.99 on API cost for GET requests (those are called Tier2 requests in S3)

Interestingly S3 only has 1.4 billion requests served whilst Fastly is reporting 2 billion 404 requests.

Not fully sure where this discrepancy is coming from but it's in the same ballpark at least.

Jul 17 '25 10:07 arianvp