galaxy
galaxy copied to clipboard
Pulsar async update failure handling bug
This is not a common occurrence, but the below bug happened because the job working directory could not be found. This was probably a temporary issue, as the state that was meant to be set was running, and I see the following job state history:
galaxy_main=> select * from job_state_history where job_id=58043483;
id | create_time | job_id | state | info
-----------+----------------------------+----------+--------+------
213931908 | 2024-05-20 16:42:33.49268 | 58043483 | new |
213989344 | 2024-05-21 10:19:16.167964 | 58043483 | queued |
213989623 | 2024-05-21 10:24:35.962527 | 58043483 | ok |
Looking at this we should however fail the job if the state to update to is one of the error states (and think about what we should do if the state is OK).
Sentry Issue: GALAXY-MAIN-WYK
ObjectNotFound: No such object found.
(5 additional frame(s) were not displayed)
...
File "galaxy/objectstore/__init__.py", line 448, in get_filename
return self._invoke("get_filename", obj, **kwargs)
File "galaxy/objectstore/__init__.py", line 424, in _invoke
return self.__getattribute__(f"_{delegate}")(obj=obj, **kwargs)
File "galaxy/objectstore/__init__.py", line 975, in _get_filename
return self._call_method("_get_filename", obj, ObjectNotFound, True, **kwargs)
File "galaxy/objectstore/__init__.py", line 1209, in _call_method
return self.backends[object_store_id].__getattribute__(method)(obj, **kwargs)
File "galaxy/objectstore/__init__.py", line 878, in _get_filename
raise ObjectNotFound
Failed to update Pulsar job status for job_id (58043483/58043483)