go-spacemesh icon indicating copy to clipboard operation
go-spacemesh copied to clipboard

checkpoint: improve HTTP error handling

Open ivan4th opened this issue 1 year ago • 3 comments

Motivation

When recovery.recovery-uri is configured and an error happens, such as network glitch or a temporary server error due to overload, go-spacemesh continues its execution without recovery, which may not be always intended behavior. This, for example, causes TestCheckpoint systest flakes sometimes. The way to fix it is to retry a few times upon transient errors (such as network glitch or a 503 server error), terminating after all the retries are exhausted, continue without an error upon 404 Not Found, and terminate go-spacemesh in other cases.

Description

When recovering from an URL, only continue without failing if checkpoint data request fails with 404 code. If the request fails with other 4xx HTTP error code, go-spacemesh startup fails. Upon other errors, including 5xx HTTP errors and network errors, retry checkpoint data request several times.recovery.retry-max and recovery.retry-delay specify retry parameters, defaulting to 5 and 3s, respectively. go-spacemesh terminates if all the retries fail.

Test Plan

Systests should pass

ivan4th avatar Jul 11 '24 21:07 ivan4th

Codecov Report

Attention: Patch coverage is 55.00000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 81.8%. Comparing base (e70a108) to head (afcac72). Report is 25 commits behind head on develop.

Files Patch % Lines
node/node.go 0.0% 10 Missing :warning:
checkpoint/recovery.go 60.0% 5 Missing and 1 partial :warning:
checkpoint/util.go 86.6% 2 Missing :warning:
Additional details and impacted files
@@           Coverage Diff           @@
##           develop   #6130   +/-   ##
=======================================
  Coverage     81.7%   81.8%           
=======================================
  Files          302     301    -1     
  Lines        32637   32620   -17     
=======================================
+ Hits         26695   26710   +15     
+ Misses        4204    4181   -23     
+ Partials      1738    1729    -9     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jul 11 '24 22:07 codecov[bot]

I don't understand why we want to default to ignoring failing requests for checkpoints in system tests? Especially in the tests that do a checkpoint and restore these should be caught and logged and cause the test to fail early (failing to fetch the checkpoint) instead of having the restore "succeed" and then fail at assertion.

I think instead we need to make sure that the pods serving the checkpoints are up and available before starting the smeshing nodes. Wdys?

The problem is more complicated than that, I explained it in the commit message but perhaps should comment in the code:

This is needed b/c in the context of TestRecovery systest, all the nodes are created with --recovery-uri flag passed, including the bootnodes. This way, they can't successfully start unless the bootstrapper is already running. But the bootstrapper requires an working go-spacemesh node to start up, thus we get a circulardependency.

This circular dependency is rather hard to resolve, it involves avoiding bootstrapper restart but that's not easily doable b/c it will involve some substantial restructuring of systest code

ivan4th avatar Jul 24 '24 21:07 ivan4th

Added a comment explaining why go-spacemesh needs to continue after unsuccessfully retrying checkpoint recovery several times. We may want to fix this unfortunate circular dependency of node-bootstrapper startup in a followup.

ivan4th avatar Jul 25 '24 00:07 ivan4th