galaxy
galaxy copied to clipboard
K8S Runner: Race condition when modifying job
Galaxy 21.01
Traceback (most recent call last):
File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 526, in _handle_job_failure
self.__cleanup_k8s_job(job)
File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 533, in __cleanup_k8s_job
stop_job(job, k8s_cleanup_job)
File "/srv/galaxy/lib/galaxy/jobs/runners/util/pykube_util.py", line 75, in stop_job
job.scale(replicas=0)
File "/srv/galaxy/venv/lib/python3.8/site-packages/pykube/mixins.py", line 32, in scale
self.update()
File "/srv/galaxy/venv/lib/python3.8/site-packages/pykube/objects.py", line 119, in update
self.api.raise_for_status(r)
File "/srv/galaxy/venv/lib/python3.8/site-packages/pykube/http.py", line 106, in raise_for_status
raise HTTPError(resp.status_code, payload["message"])
pykube.exceptions.HTTPError: Operation cannot be fulfilled on jobs.batch "gxy-islandcompare-test-tlpmh": the object has been modified; please apply your changes to the latest version and try
again
The runner needs to catch this, refresh, and retry.
That's fixed with https://github.com/galaxyproject/galaxy/pull/11715, right ? Thanks for the report and fix!
Actually, this wouldn't be covered by #11715 I totally forgot about this.
This issue is still causing regular (but random) job failures for us. I'd say every 50th job is affected by this. We thought this would have been addressed in https://github.com/galaxyproject/galaxy/pull/15238 for kubernetes >=1.26, but the issue persists with Galaxy v24.1 (just now the error message is "An unknown error occurered with this job" [sic]).