bionic
bionic copied to clipboard
Bionic should retry GCS uploads/downloads if they time out
trafficstars
Sometimes GCS file uploads (and presumably downloads) can time out (stack trace attached below). For most of these operations we use the GCS Python API rather than gsutil, so it's probably not retrying by default. We should probably add some retry logic to reduce the chance of transient failures crashing the whole process.
File "/usr/local/lib/python3.7/site-packages/bionic/cache.py", line 284, in _blob_from_file
self._cloud.upload(file_path, blob_url)
File "/usr/local/lib/python3.7/site-packages/bionic/cache.py", line 601, in upload
self._tool.blob_from_url(url).upload_from_filename(str(path))
File "/usr/local/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1320, in upload_from_filename
predefined_acl=predefined_acl,
File "/usr/local/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1265, in upload_from_file
client, file_obj, content_type, size, num_retries, predefined_acl
File "/usr/local/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1175, in _do_upload
client, stream, content_type, size, num_retries, predefined_acl
File "/usr/local/lib/python3.7/site-packages/google/cloud/storage/blob.py", line 1122, in _do_resumable_upload
response = upload.transmit_next_chunk(transport)
File "/usr/local/lib/python3.7/site-packages/google/resumable_media/requests/upload.py", line 425, in transmit_next_chunk
retry_strategy=self._retry_strategy,
File "/usr/local/lib/python3.7/site-packages/google/resumable_media/requests/_helpers.py", line 136, in http_request
return _helpers.wait_and_retry(func, RequestsMixin._get_status_code, retry_strategy)
File "/usr/local/lib/python3.7/site-packages/google/resumable_media/_helpers.py", line 150, in wait_and_retry
response = func()
File "/usr/local/lib/python3.7/site-packages/google/auth/transport/requests.py", line 287, in request
**kwargs
File "/usr/local/lib/python3.7/site-packages/google/auth/transport/requests.py", line 110, in __exit__
raise self._timeout_error_type()
requests.exceptions.Timeout
Just noting that we've also seen a google.resumable_media.common.DataCorruption error in the wild; however, I don't know if this is something that would be fixed with a retry.