Detect and retry jobs that Kubernetes failed to start
We see weird errors where kubectl create ... works but then pod setup/scheduling fails for obscure reasons (google-fluentd out-of-mem; filesystems not well mounted; ...).
If we can detect those (i.e. if we know that the job's actual commands didn't begin) we could try to restart them a few times.
- all the jobs could start with a phone-home, or creating a witness-file?
- if
kubectl get ...never succeeds can we assume the job didn't start?
Just noting that we currently retry 10 times; it would be nice to have some empirical justification for that number.
Also noting how @ihodes saw this failure manifest:
this is the failure that getting the pod (checking its status) doesn't work multiple times
Reason: "Error-from-coclobas: Updating failed 11 times: [ Shell-command
failed:\nCommand:\n```\nkubectl get pod 6608a1dc-cbbc-5fbc-912b-9f97bd38e51b
-o=json\n```\nStatus: Exited with 1\n\nStandard-output: empty.\n\nStandard-error:\n```\nError from
server: pods \"6608a1dc-cbbc-5fbc-912b-9f97bd38e51b\" not found\n```\n ]"
Keep hitting this 😢
File an issue upstream?