feature request: dynamodb cache backend

Open christianscott opened this issue 1 month ago • 4 comments

One of the best things about bazel-remote is that it gives you an ~infinitely scalable remote cache with ~zero operational overhead if you use S3 as the backend. At my previous job, we rarely had to think about bazel-remote despite having a very large CI cluster. This is how we used it for many years:

The problem with this setup is that reading an objects from S3 is slow due to the high round trip time (RTT). This can be tolerable in small repositories (since N fetches can happen in parallel), or if the disk cache is warm, but can be killer in large repos with long chains of dependencies, or when the disk cache is cold. At my previous job, there were instances where a cold build would have been faster than using the remote cache.

RTT could be reduced by using a second tier of machines running bazel-remote in between the CI machines and S3, but this would (IMO) increase the operational overhead to the point where a RE cluster like buildbarn starts to look more appealing.

My suggestion: introduce support for using DynamoDB (DDB) as a cache backend. Cache entries (AC and CAS) can be retrieved from DDB with far lower latency than with S3. Latency for fetches within the same AWS region can be as much as 20x faster (20ms for S3, 1ms for DDB). This results in much faster fully- and partially-cached builds. Setting up DDB as a cache backend is as simple as creating a new table with the correct partition key and setting a config flag.

One important limitation of DDB is the maximum item size of 400kb. S3 of course has no such limitations. We can work around this limitation by writing large blobs to S3, and simply writing “pointer” objects to DynamoDB. Blobs smaller than the limit can be written directly to DDB.

An "inlined" entry might look like this:

{
  "ObjectKey": "ac#abc123",
  "data": "<binary>",
  "s3Uri": null
}

And a "pointer" entry like this:

{
  "ObjectKey": "ac#abc123",
  "data": null,
  "s3Uri": "s3://bucket-name/bazel-remote/ac/ab/c123"
}

Results

I have created a prototype here: https://github.com/christianscott/bazel-remote/commits/dynamodb-proxy

I tested two projects. The first was a simulated “worst case” “dummy” project (see below), a chain of tiny genrule() that depend serially on one another. The second was the bazelbuild/bazel source code. I tested these both from my laptop with S3/DDB in the nearest AWS region (ap-southeast-2, 8.5ms away) and on an EC2 instance within the same datacenter. To take these measurements I first performed a cold build to populate the remote cache, then cleared both bazel and bazel-remote's disk caches, and then re-ran the build. The second build is the one that I measured.

Results: fully cached builds with the DDB-backed cache were 2-5x faster than with the current S3-backed implementation. Builds on my laptop were about twice as fast, and builds on EC2 were 2-5x faster. This difference is explained by the latency between my laptop and the ap-southeast-2. RTT of calls to DDB drop from 17ms on average to just over 1ms on average.

Project	Env	Cache impl	Avg. latency (ms)	Build duration (s)	Speedup vs baseline
dummy	Laptop	S3	27.76	65.703	-
dummy	Laptop	DDB	9.68	24.177	2.7x
dummy	EC2	S3	20.49	46.402	-
dummy	EC2	DDB	1.22	8.794	5.3x
bazelbuild/bazel	Laptop	S3	35.61	56.946	-
bazelbuild/bazel	Laptop	DDB	17.84	27.044	2.1x
bazelbuild/bazel	EC2	S3	15.90	20.138	-
bazelbuild/bazel	EC2	DDB	2.34	8.189	2.5x

I have uploaded the raw data here: https://gist.github.com/christianscott/89dab76596c648e8ff5b6aca5ecaed94

The BUILD.bazel file for the dummy project. Build via bazel build :mk999.

[
    genrule(
        name = "mk{}".format(i),
        outs = ["hello{}.txt".format(i)],
        cmd = "echo hello from {} > $@".format(i),
        srcs = ["hello{}.txt".format(i - 1)] if i > 0 else [],
    )
    for i in range(1000)
]

Nov 23 '25 07:11 christianscott