etl icon indicating copy to clipboard operation
etl copied to clipboard

Should retry some storage errors.

Open gfr10598 opened this issue 3 years ago • 3 comments

We are currently seeing a low rate of GCS storage errors:

2021/04/13 04:54:19 rowwriter.go:119: googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-staging ndt/ndt7/2020/08/27/20200827T170704.505210Z-ndt7-mlab3-lhr05-ndt.tgz.json
textPayload: "2021/04/13 04:54:19 rowwriter.go:119: googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-staging ndt/ndt7/2020/08/27/20200827T170704.505210Z-ndt7-mlab3-lhr05-ndt.tgz.json
"

These would likely succeed on retry.

gfr10598 avatar Apr 13 '21 16:04 gfr10598

Write failure errors

After adding a retry with a 2 second delay, we are still seeing the same write errors.

gfr10598 avatar Apr 14 '21 13:04 gfr10598

Looks like there is very little retry happening the library. If I add a 20 second delay, and retry, it looks like the initial attempt takes between 0 and 5 seconds - not much retry. The Write retries then fail every 20 seconds - never succeed.

There is then a later failed retry, with fewer rows, likely driven by the Flush prior to Close at the end of the archive. Not clear what happened in between. Will investigate further.

2021/04/15 04:06:48 rowwriter.go:122: Retrying after 347234 of 385862 bytes 10m3s googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-sandbox/ndt/annotation/2020/11/23/20201123T104213.627630Z-annotation-mlab1-ham02-ndt.tgz.json
"2021/04/15 04:07:08 rowwriter.go:122: Retrying after 0 of 965 bytes 20s googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-sandbox/ndt/annotation/2020/11/23/20201123T104213.627630Z-annotation-mlab1-ham02-ndt.tgz.json
2021/04/15 04:16:48 task.go:179: Processed 4401 files, 0 nil data, 4359 rows committed, 42 failed, from gs://archive-measurement-lab/ndt/annotation/2020/11/23/20201123T104213.627630Z-annotation-mlab1-ham02-ndt.tgz into annotation

gfr10598 avatar Apr 15 '21 13:04 gfr10598

Tried running many retries, with 20 seconds between. In half a dozen failures, none ever later succeeded. The close also fails.

Checking GCS shows that the corresponding file still exists from a previous parsing, and has not been replaced.

Likely we should abandon the partially written file, probably by cancelling the context that was used to create the object handle?

gfr10598 avatar Apr 16 '21 20:04 gfr10598