homeworld icon indicating copy to clipboard operation
homeworld copied to clipboard

Make autodeploys more reliable, or at least more explanatory when they fail

Open celskeggs opened this issue 6 years ago • 1 comments

We're currently hitting intermittent failures like the below: https://hijinks.mit.edu/jenkins/job/homeworld/job/temp/52/console

This is probably related to something going on that prevents either the apiserver from starting, or prevents something in prometheus from working correctly. However, I have been unable to replicate it in a toastfs-dev autodeploy environment.

It would be nice to build some better tools for automatically explaining what happened when an autodeploy fails, so that we can figure out what's going on in these edge cases. We should also fix this particular autodeploy problem so that our Jenkins builds are more reliable.

celskeggs avatar Oct 24 '19 16:10 celskeggs

It's intermittent, I wouldn't be surprised if you couldn't replicate it on toastfs-dev. Probably just a timeout issue. The best I can think of is to have Jenkins make a copy of the disks somewhere whenever a failure occurs, and debug it manually afterwards.

krawthekrow avatar Oct 24 '19 18:10 krawthekrow