gitlab-runner icon indicating copy to clipboard operation
gitlab-runner copied to clipboard

Fail cache downloads when required, retry on memory alloc fail

Open shawnburke opened this issue 3 years ago • 1 comments

Issue

Cache downloads can fail when containers have memory pressure. This happens somewhat regularly with our jobs, particularly when the cache has a lot of files in the ZIP, such as a node_modules cache.

The problem is that these do NOT fail the job on the cache unzip failure but the cache is effectively corrupt: files will be missing or truncated.

My theory is that this is due to GC pressure. The extract code unzips all of the files, iterating the archive in a loop. Inspecting the Golang flate library, there does not appear to be pooling between the decompressors, eg. they each allocate their own memory. It looks like they try to use fairly small buffers, but its not clear how much they might store at a given time. In any case, with an archive that may have 100K files, this loop will create these things quickly and GC may get behind.

So this fix:

  1. Detects memory errors
  2. Retries memory allocation errors 3 times
  3. Waits 1s and triggers a GC on memory error

Separately, it introduces a required field on cache so that failures to unzip the cache will fail the job immediately. This seems like the ideal behavior (a broken cache is going to be a broken build), but to make this non-breaking this field is introduced as a boolean.

Symptom

Downloading cache.zip from https://s3.dualstack.us-west-2.amazonaws.com/ci-gitlab.foo.com/project/3304/ui-cache 
WARNING: ui/node_modules/date-fns/locale/ta/_lib/match/index.js: write ui/node_modules/date-fns/locale/ta/_lib/match/index.js: cannot allocate memory (suppressing repeats) 
WARNING: logn-ui/node_modules/date-fns/locale/ta/index.d.ts: read ../../../../../../cache/...-monorepo/cache.zip: cannot allocate memory (suppressing repeats) 
/scripts-dafas-2051793280/restore_cache: line 203:   113 Killed                  '/usr/bin/gitlab-runner-helper' "cache-extractor" 

These are intermittent and will succeed on retrying the job in many cases.

Added tests, etc.

Might be worth splitting into 2 PRs, one for the extractor, one for the command / cache changes.

shawnburke avatar Feb 04 '22 01:02 shawnburke

@shawnburke Apologies, this repository is a mirror of https://gitlab.com/gitlab-org/gitlab-runner where the development happens. Would you mind opening a merge request there?

Reprazent avatar Mar 04 '22 19:03 Reprazent

  • [ ] ~~********~~

Valentinodelicious avatar Jun 11 '23 17:06 Valentinodelicious

Closing this PR, would be greate to see you on gitlab.com...

StingRayZA avatar Oct 23 '23 09:10 StingRayZA