quickstart icon indicating copy to clipboard operation
quickstart copied to clipboard

Stop all services if core settings upgrade fails

Open sisuresh opened this issue 9 months ago • 5 comments

Resolves https://github.com/stellar/quickstart/issues/547

An example of the failure that this can result in can be seen here - https://github.com/sisuresh/docker-stellar-core-horizon/pull/1.

sisuresh avatar Apr 26 '24 23:04 sisuresh

We have this issue on a larger stage that for this image when things go wrong, it rarely stops running. Sometimes this is okay, as it would be actually more disruptive if the failure of a single service that a dev might not even be using fails to start. But routinely these quiet failures cause us to miss a failure until some later point where we are debugging the problem indirectly.

I'm saying this because I spent yesterday debugging a failure in another part of the start script where it just kept on going, with a small error in the logs that was not easy to spot.

Instead of the change here, I'm wondering if we should set some different failure configs so that if anything goes wrong we hard exit the script, and so at the location of this check we'd do that instead of stopping services.

The change in this PR prevents the github actions from passing, so we'll never publish a broken image that does this type of error checking. Hard exiting would be ideal though, but I'm not sure how to do that. If that's possible, then we should go that route.

sisuresh avatar Apr 30 '24 16:04 sisuresh

Looks like there are still situations where the upgrade can fail, such as if the file needs updating due to an xdr change, and the image can keep running.

leighmcculloch avatar May 01 '24 13:05 leighmcculloch

Looks like there are still situations where the upgrade can fail, such as if the file needs updating due to an xdr change, and the image can keep running.

Argh yeah good point. This check requires the previous steps to pass. A "health" check at the end where we validate that the upgrade went through would be better. The fact that the errors are just swallowed by the service is annoying though.

sisuresh avatar May 01 '24 16:05 sisuresh

I'm looking into doing something like https://serverfault.com/a/922943.

sisuresh avatar May 01 '24 16:05 sisuresh

@leighmcculloch was the error you ran into in the start script or within a service? I made a change to just trap on ERR and kill supervisor, but it'll only trap in the upgrade_local function for now.

sisuresh avatar May 01 '24 18:05 sisuresh

This pull request is stale because it has been open for 30 days with no activity. It will be closed in 30 days unless the stale label is removed.

github-actions[bot] avatar Jun 01 '24 18:06 github-actions[bot]

@leighmcculloch have you seen an action error like the one currently failing in this PR before? It looks like the build-stellar-core step under Testing just times out after 6 hours.

sisuresh avatar Jun 13 '24 15:06 sisuresh

I've seen it a couple times, intermittent, not consistent. On other PRs rerunning the build resulted in a pass. I kicked it again.

leighmcculloch avatar Jun 13 '24 23:06 leighmcculloch