cache-buildkite-plugin icon indicating copy to clipboard operation
cache-buildkite-plugin copied to clipboard

Prevent parallel jobs from overwriting the same s3 object when saving the cache

Open ghost opened this issue 2 years ago • 5 comments

Suppose we have N jobs running at the same time using the same cache key and that there is no cache saved yet. The job to finish last will overwrite the cache saved by the ones before.

job_a [0% -----------------------100%] -> cache (will overwrite cache saved from b and c)
job_b       [0% ------------100%] -> cache (will overwrite cache saved from c)
job_c    [0% -------- 100%] -> cache

It seems like the plugin only checks for the s3 object in restore() but not in cache().

ghost avatar Jul 17 '23 21:07 ghost

Just to follow up on this issue, the problem is multiple re-write of the cache that is already there.

A simple use-case is a node_module cache for a large web project, they are usually heavy (e.g. 500mb+) and required for every check in the project (e.g. test / lint / prettier), but when a key changes, every job will not see the old cache -> generate new dependencies -> upload the cache. And the upload step is usually much longer than a download, it adds 3-5 minutes to every job in a PR even though this cache was already saved from the fastest ones.

Ideally it should check that cache under the key already exists before saving - similar case in the GHA implementation of caches would save the cache for the fastest job, and then bail out on others with this error

Unable to reserve cache with key ${key}, another job may be creating this cache

kliakhovskii-brex avatar Aug 24 '23 05:08 kliakhovskii-brex

I'm looking for the possible solutions on this. Still not started to work on code but I'm considering few options.

gencer avatar Aug 24 '23 17:08 gencer

Note: s3 key is not available yet until actual object completely uploaded, and request finished.

gencer avatar Aug 24 '23 17:08 gencer

Note: s3 key is not available yet until actual object completely uploaded, and request finished.

Potentially can reserve it thru a dummy file. But to be honest, even having double check before upload would already cut a lot of cases of re-upload, especially for the long polls in pipelines. Might not even track the current upload for that.

kliakhovskii-brex avatar Sep 14 '23 04:09 kliakhovskii-brex

Note: s3 key is not available yet until actual object completely uploaded, and request finished.

Potentially can reserve it thru a dummy file. But to be honest, even having double check before upload would already cut a lot of cases of re-upload, especially for the long polls in pipelines. Might not even track the current upload for that.

Multipart uploads solve this issue if versioning is enabled in the bucket.

Writing a manifest file (or "dummy file") is usually done after a big upload to confirm you're done (e.g. when uploading 1000s of CSVs or Parquet files) -- if you use the manifest file to signal "I am uploading", and you die, others will see the manifest and think, "gosh guess he's still uploading"!

Multipart uploads handle this situation gracefully where everyone is racing to write the current version, but no one can write an incomplete version!

nuzayets avatar Nov 02 '23 17:11 nuzayets