edgedb-cli
edgedb-cli copied to clipboard
CLI should retry 'Connection reset by peer' errors returned from the cloud on GET requests
CLI version: 3.2.0-dev (157e024b545531dd1c69b54632676b4be55ed046
, master from 2023-06-01)
repro:
run a script that creates & deletes cloud instances over and over again (see https://github.com/edgedb/nebula/issues/496 for an example of the script, it essentially just does instance create via the CLI, sleeps two minutes, deletes the instance, sleeps another two minutes, GOTO 10)
observed behavior:
intermittently, polling for the status of a create / delete operation will fail with an error like:
edgedb error: Could not destroy EdgeDB Cloud instance: HTTP error: error sending request for url (https://api.g.aws-dev-zackelan.edgedb.cloud/v1/operations/27455ca8-08c3-11ee-845c-370cf6b64520): error trying to connect: Connection reset by peer (os error 104)
or
edgedb error: Could not destroy EdgeDB Cloud instance: HTTP error: error sending request for url (https://api.g.aws-dev-zackelan.edgedb.cloud/v1/operations/f1a75360-0987-11ee-a204-e337c4c351ef): error trying to connect: Connection reset by peer (os error 104)
this is caused by the Nebula server being restarted about once an hour to pick up new AWS credentials, and HTTP connections being closed un-gracefully when that happens.
desired behavior:
any GET request (but particularly the loop that polls for operation status of create/delete instances) should have retry logic on a connection reset. we also want to improve this behavior on the backend to close connections more gracefully (https://github.com/edgedb/nebula/issues/541), but these errors can happen for a myriad of other reasons and I think retries from the CLI make sense.
this is not asking for retries of POST requests (such as the create/delete instance request) even though I've seen connection resets on those as well. retrying them is a thornier problem, because for example at the time the connection is reset, the backend may or may not have enqueued that operation, so a simple retry might or might not have the desired outcome.
Looks like we don't have retries at all yet. So there are more errors to retry on, even unconditionally on POST requests.
Also I think our API should be structured in a way that makes POST/PUT/DELETE requests idempotent too, if we still don't do that.