coclobas Detect and retry jobs that Kubernetes failed to start

We see weird errors where kubectl create ... works but then pod setup/scheduling fails for obscure reasons (google-fluentd out-of-mem; filesystems not well mounted; ...).

If we can detect those (i.e. if we know that the job's actual commands didn't begin) we could try to restart them a few times.

all the jobs could start with a phone-home, or creating a witness-file?
if kubectl get ... never succeeds can we assume the job didn't start?

Nov 04 '16 20:11 smondet

Just noting that we currently retry 10 times; it would be nice to have some empirical justification for that number.

Also noting how @ihodes saw this failure manifest:

this is the failure that getting the pod (checking its status) doesn't work multiple times

Reason: "Error-from-coclobas: Updating failed 11 times: [ Shell-command 
failed:\nCommand:\n```\nkubectl get pod 6608a1dc-cbbc-5fbc-912b-9f97bd38e51b
-o=json\n```\nStatus: Exited with 1\n\nStandard-output: empty.\n\nStandard-error:\n```\nError from 
server: pods \"6608a1dc-cbbc-5fbc-912b-9f97bd38e51b\" not found\n```\n ]"

Dec 07 '16 14:12 hammer

Keep hitting this 😢

Dec 19 '16 19:12 ihodes

File an issue upstream?

Dec 19 '16 19:12 hammer