Prevent parallel jobs from overwriting the same s3 object when saving the cache
Suppose we have N jobs running at the same time using the same cache key and that there is no cache saved yet. The job to finish last will overwrite the cache saved by the ones before.
job_a [0% -----------------------100%] -> cache (will overwrite cache saved from b and c)
job_b [0% ------------100%] -> cache (will overwrite cache saved from c)
job_c [0% -------- 100%] -> cache
It seems like the plugin only checks for the s3 object in restore() but not in cache().
Just to follow up on this issue, the problem is multiple re-write of the cache that is already there.
A simple use-case is a node_module cache for a large web project, they are usually heavy (e.g. 500mb+) and required for every check in the project (e.g. test / lint / prettier), but when a key changes, every job will not see the old cache -> generate new dependencies -> upload the cache. And the upload step is usually much longer than a download, it adds 3-5 minutes to every job in a PR even though this cache was already saved from the fastest ones.
Ideally it should check that cache under the key already exists before saving - similar case in the GHA implementation of caches would save the cache for the fastest job, and then bail out on others with this error
Unable to reserve cache with key ${key}, another job may be creating this cache
I'm looking for the possible solutions on this. Still not started to work on code but I'm considering few options.
Note: s3 key is not available yet until actual object completely uploaded, and request finished.
Note: s3 key is not available yet until actual object completely uploaded, and request finished.
Potentially can reserve it thru a dummy file. But to be honest, even having double check before upload would already cut a lot of cases of re-upload, especially for the long polls in pipelines. Might not even track the current upload for that.
Note: s3 key is not available yet until actual object completely uploaded, and request finished.
Potentially can reserve it thru a dummy file. But to be honest, even having double check before upload would already cut a lot of cases of re-upload, especially for the long polls in pipelines. Might not even track the current upload for that.
Multipart uploads solve this issue if versioning is enabled in the bucket.
Writing a manifest file (or "dummy file") is usually done after a big upload to confirm you're done (e.g. when uploading 1000s of CSVs or Parquet files) -- if you use the manifest file to signal "I am uploading", and you die, others will see the manifest and think, "gosh guess he's still uploading"!
Multipart uploads handle this situation gracefully where everyone is racing to write the current version, but no one can write an incomplete version!