Artifact upload retries are far too slow
Recently (past month or so?) we have started hitting incredibly inconsistent timings for artifact uploads, where an upload that should normally take ~5s ends up taking almost 17m, which is approaching double the time of our average build, for example, here is the output for one of these incredibly long uploads.
...
2021-09-20 07:52:55 DEBUG Waiting for uploads to complete...
2021-09-20 07:52:55 INFO Uploading artifact <artifact id> (25372521 bytes)
2021-09-20 07:52:55 DEBUG POST https://s3.amazonaws.com/buildkiteartifacts.com
2021-09-20 08:09:38 WARN Post "https://s3.amazonaws.com/buildkiteartifacts.com": write tcp 172.31.8.38:54986->52.216.95.133:443: write: connection timed out (Attempt 1/10 Retrying in 5s)
2021-09-20 08:09:43 DEBUG POST https://s3.amazonaws.com/buildkiteartifacts.com
2021-09-20 08:09:45 INFO Successfully uploaded artifact "linux-client.tar.gz"
2021-09-20 08:09:45 DEBUG Uploads complete, waiting for upload status to be sent to buildkite...
2021-09-20 08:09:46 DEBUG Artifact `1d417c03-0eee-45c5-b17e-cbda357caded` has state `finished`
...
I assume that it is doing some kind of exponential backoff due to the time it takes, but there is only ever the one warning printing for the first retry.
This massive spike in wall times prompted me to add a small workaround in our build script to just kill the buildkite-agent artifact upload process if it went over a timeout (AFAICT there is no way to a specify a timeout when doing the upload) and retry it, and this works just fine....until it comes to downloading the artifacts later. Even though the first artifact that was uploaded was killed, and in the REST API result is marked as "state": "new", rather than "state": "finished", when doing a buildkite-agent artifact download, buildkite complains with
2021-09-20 13:27:52 FATAL Failed to download artifacts: GET https://agent.buildkite.com/v3/builds/<build-id>/artifacts/search?query=linux-client.tar.gz: 400 Multiple artifacts were found for query: `linux-client.tar.gz`. Try scoping by the job ID or name.
Which breaks that retry mechanic, so I've added a workaround for the workaround to use a metadata tag to indicate the actual query to use, and long story short, this feels like something that the buildkite-agent itself could handle more gracefully.