gitlab-runner
gitlab-runner copied to clipboard
Fail cache downloads when required, retry on memory alloc fail
Issue
Cache downloads can fail when containers have memory pressure. This happens somewhat regularly with our jobs, particularly when the cache has a lot of files in the ZIP, such as a node_modules cache.
The problem is that these do NOT fail the job on the cache unzip failure but the cache is effectively corrupt: files will be missing or truncated.
My theory is that this is due to GC pressure. The extract code unzips all of the files, iterating the archive in a loop. Inspecting the Golang flate
library, there does not appear to be pooling between the decompressors, eg. they each allocate their own memory. It looks like they try to use fairly small buffers, but its not clear how much they might store at a given time. In any case, with an archive that may have 100K files, this loop will create these things quickly and GC may get behind.
So this fix:
- Detects memory errors
- Retries memory allocation errors 3 times
- Waits 1s and triggers a GC on memory error
Separately, it introduces a required
field on cache
so that failures to unzip the cache will fail the job immediately. This seems like the ideal behavior (a broken cache is going to be a broken build), but to make this non-breaking this field is introduced as a boolean
.
Symptom
Downloading cache.zip from https://s3.dualstack.us-west-2.amazonaws.com/ci-gitlab.foo.com/project/3304/ui-cache
WARNING: ui/node_modules/date-fns/locale/ta/_lib/match/index.js: write ui/node_modules/date-fns/locale/ta/_lib/match/index.js: cannot allocate memory (suppressing repeats)
WARNING: logn-ui/node_modules/date-fns/locale/ta/index.d.ts: read ../../../../../../cache/...-monorepo/cache.zip: cannot allocate memory (suppressing repeats)
/scripts-dafas-2051793280/restore_cache: line 203: 113 Killed '/usr/bin/gitlab-runner-helper' "cache-extractor"
These are intermittent and will succeed on retrying the job in many cases.
Added tests, etc.
Might be worth splitting into 2 PRs, one for the extractor, one for the command / cache changes.
@shawnburke Apologies, this repository is a mirror of https://gitlab.com/gitlab-org/gitlab-runner where the development happens. Would you mind opening a merge request there?
- [ ] ~~********~~
Closing this PR, would be greate to see you on gitlab.com...