If a job fails very quickly, we never get any logs
We already try to save the logs when a job dies: https://github.com/hammerlab/coclobas/blob/master/src/lib/server.ml#L227
What happened in your case?
Can you still describe it for example?
Describe works, the job is run on the docker, but dies immediately (bad CLI args in my shell script) and exists.
opam@e2b78e43fa00:/coclo/_cocloroot/logs/logs/job/522c556b-b975-567e-b254-02d4beadc9ca/commands$ cat 1481229903345_3c1b3504.json
{
"command": {
"command": "kubectl logs 522c556b-b975-567e-b254-02d4beadc9ca",
"stdout": "",
"stderr":
"Error from server: Get https://gke-ihodes-coco3-cluster-default-pool-36378887-pskd:10250/containerLogs/default/522c556b-b975-567e-b254-02d4beadc9ca/522c556b-b975-567e-b254-02d4beadc9cacontainer: No SSH tunnels currently open. Were the targets able to accept an ssh-key for user \"gke-e170239faa5e49b2ac95\"?\n",
"status": [ "Exited", 1 ],
"exn": null
}
}
This may be due to the Google Gcloud metadata limitation; we run out of room at 32kb or something absurd (project-wide).
I also had similar issues where the describe log showed a successful allocation of resources and the initiation of the job, yet the job fails without any kubernetes log. For example, when you pass invalid URLs to wget (that is passing a poorly constructed URL to either --tumor, --rna or --normal), those fetch jobs also fail fast and leave no trace behind them.
This may have been the "ran out of metadata space on GCP" issue again.