flow-go icon indicating copy to clipboard operation
flow-go copied to clipboard

[Block Uploader] Add automatic retries when block uploads fail

Open peterargue opened this issue 2 years ago • 2 comments

Problem Definition

Execution nodes have a feature that allows uploading block execution results to GCP for consumption by DPS. The uploader has support for tracking and retrying failed uploads, but it only checks during startup. That means that if an upload fails for any reason (e.g. GCP returns a 503), the block will be missing until the node reboots.

Since EN's take a long time to startup, rebooting to recover uploads isn't practical.

Proposed Solution

Add immediate automated retries when block uploads fail.

Definition of Done

When a block upload fails, automatically retry it a configurable number of times before giving up. Add an info level message when a block is successfully uploaded, and an error level message when it fails.

peterargue avatar Sep 27 '22 17:09 peterargue

the current EN upload call will automatically retry for 5 times, besides the retry feature we recently added in #2743 . So the question is whether we should retry upload after every 5 failed upload attempts.

@peterargue is it frequent that certain block computation result upload still fails after the default 5 retries? one thing we can also do is to write a script to get a list of blocks those failed to be uploaded from badger and then do manual upload in the script. however it should not be very difficult to add a period upload retry feature, if it is more convenient. @m4ksio

Tonix517 avatar Sep 27 '22 18:09 Tonix517

Can we make an admin command to trigger that? Or have some background tasks doing that say, every 10 minutes

m4ksio avatar Oct 03 '22 16:10 m4ksio

Closing. upload failures were significantly reduced by fixing error handling in the uploader: https://github.com/onflow/flow-go/pull/3290

peterargue avatar Jan 17 '23 23:01 peterargue