Retry transient GCS errors
https://github.com/mozilla/bedrock/actions/runs/4598182706/jobs/8121878736 https://github.com/mozilla/glean/actions/runs/4609412250/jobs/8146505104?pr=2441
gsutil is failing to download objects that fail with 404 exceptions:
Error: Command ['gsutil', '-q', '-m', 'rsync', '-r', 'gs://probe-scraper-prod-artifacts/glean/', '/tmp/tmpy4y4leon/output/glean'] returned non-zero exit status 1:
NotFoundException: 404 gs://probe-scraper-prod-artifacts/glean/reference-browser/general does not exist.
NotFoundException: 404 gs://probe-scraper-prod-artifacts/glean/reference-browser/pings does not exist.
NotFoundException: 404 gs://probe-scraper-prod-artifacts/glean/reference-browser/tags does not exist.
CommandException: 3 files/objects could not be copied/removed.
the error is transient, because the objects do exist, ~but presumably are temporarily disappearing during upload or something like that.~ edit: but they have been updated since gsutil listed them, and gsutil requests the specific version at time of listing.
we could retry the full gsutil sync on failure, or we could reimplement the gsutil sync in python and retry 404s. the latter option is probably more robust, and should be relatively short.
@relud should we consider adding in some logging to help understanding the issue first, e.g. what's in GoogleCloudPlatform/gsutil#906 ?
we could add the -DD flag:
OPTIONS
-D Shows HTTP requests/headers and additional debug info needed
when posting support requests, including exception stack traces.
CAUTION: The output from using this flag includes authentication
credentials. Before including this flag in your command, be sure
you understand how the command's output is used, and, if
necessary, remove or redact sensitive information.
-DD Same as -D, plus HTTP upstream payload.
but I wouldn't recommend it, as those headers will include auth tokens.
That said, I can confirm from running the command locally with -DD that gsutil does request a specific "generation" of objects, so if the file was rewritten between listing the object and downloading the content, I would expect it to 404.