coclobas icon indicating copy to clipboard operation
coclobas copied to clipboard

If a job fails very quickly, we never get any logs

Open ihodes opened this issue 9 years ago • 5 comments

ihodes avatar Dec 08 '16 20:12 ihodes

We already try to save the logs when a job dies: https://github.com/hammerlab/coclobas/blob/master/src/lib/server.ml#L227

What happened in your case? Can you still describe it for example?

smondet avatar Dec 08 '16 20:12 smondet

Describe works, the job is run on the docker, but dies immediately (bad CLI args in my shell script) and exists.

opam@e2b78e43fa00:/coclo/_cocloroot/logs/logs/job/522c556b-b975-567e-b254-02d4beadc9ca/commands$ cat 1481229903345_3c1b3504.json
{
  "command": {
    "command": "kubectl logs 522c556b-b975-567e-b254-02d4beadc9ca",
    "stdout": "",
    "stderr":
      "Error from server: Get https://gke-ihodes-coco3-cluster-default-pool-36378887-pskd:10250/containerLogs/default/522c556b-b975-567e-b254-02d4beadc9ca/522c556b-b975-567e-b254-02d4beadc9cacontainer: No SSH tunnels currently open. Were the targets able to accept an ssh-key for user \"gke-e170239faa5e49b2ac95\"?\n",
    "status": [ "Exited", 1 ],
    "exn": null
  }
}

ihodes avatar Dec 08 '16 20:12 ihodes

This may be due to the Google Gcloud metadata limitation; we run out of room at 32kb or something absurd (project-wide).

ihodes avatar Dec 08 '16 20:12 ihodes

I also had similar issues where the describe log showed a successful allocation of resources and the initiation of the job, yet the job fails without any kubernetes log. For example, when you pass invalid URLs to wget (that is passing a poorly constructed URL to either --tumor, --rna or --normal), those fetch jobs also fail fast and leave no trace behind them.

armish avatar Dec 08 '16 20:12 armish

This may have been the "ran out of metadata space on GCP" issue again.

ihodes avatar Dec 08 '16 22:12 ihodes