galaxy icon indicating copy to clipboard operation
galaxy copied to clipboard

K8S Runner: Race condition when modifying job

Open innovate-invent opened this issue 4 years ago • 3 comments

Galaxy 21.01

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 526, in _handle_job_failure
    self.__cleanup_k8s_job(job)
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 533, in __cleanup_k8s_job
    stop_job(job, k8s_cleanup_job)
  File "/srv/galaxy/lib/galaxy/jobs/runners/util/pykube_util.py", line 75, in stop_job
    job.scale(replicas=0)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/pykube/mixins.py", line 32, in scale
    self.update()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/pykube/objects.py", line 119, in update
    self.api.raise_for_status(r)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/pykube/http.py", line 106, in raise_for_status
    raise HTTPError(resp.status_code, payload["message"])
pykube.exceptions.HTTPError: Operation cannot be fulfilled on jobs.batch "gxy-islandcompare-test-tlpmh": the object has been modified; please apply your changes to the latest version and try
 again

The runner needs to catch this, refresh, and retry.

innovate-invent avatar Mar 25 '21 18:03 innovate-invent

That's fixed with https://github.com/galaxyproject/galaxy/pull/11715, right ? Thanks for the report and fix!

mvdbeek avatar Jun 15 '21 10:06 mvdbeek

Actually, this wouldn't be covered by #11715 I totally forgot about this.

innovate-invent avatar Jun 15 '21 18:06 innovate-invent

This issue is still causing regular (but random) job failures for us. I'd say every 50th job is affected by this. We thought this would have been addressed in https://github.com/galaxyproject/galaxy/pull/15238 for kubernetes >=1.26, but the issue persists with Galaxy v24.1 (just now the error message is "An unknown error occurered with this job" [sic]).

pascalg avatar Jul 17 '24 07:07 pascalg