Pulsar async update failure handling bug

Open galaxyproject-sentryintegration[bot] opened this issue 1 year ago • 0 comments

This is not a common occurrence, but the below bug happened because the job working directory could not be found. This was probably a temporary issue, as the state that was meant to be set was running, and I see the following job state history:

galaxy_main=> select * from job_state_history where job_id=58043483;
    id     |        create_time         |  job_id  | state  | info
-----------+----------------------------+----------+--------+------
 213931908 | 2024-05-20 16:42:33.49268  | 58043483 | new    |
 213989344 | 2024-05-21 10:19:16.167964 | 58043483 | queued |
 213989623 | 2024-05-21 10:24:35.962527 | 58043483 | ok     |

Looking at this we should however fail the job if the state to update to is one of the error states (and think about what we should do if the state is OK).

Sentry Issue: GALAXY-MAIN-WYK

ObjectNotFound: No such object found.
(5 additional frame(s) were not displayed)
...
  File "galaxy/objectstore/__init__.py", line 448, in get_filename
    return self._invoke("get_filename", obj, **kwargs)
  File "galaxy/objectstore/__init__.py", line 424, in _invoke
    return self.__getattribute__(f"_{delegate}")(obj=obj, **kwargs)
  File "galaxy/objectstore/__init__.py", line 975, in _get_filename
    return self._call_method("_get_filename", obj, ObjectNotFound, True, **kwargs)
  File "galaxy/objectstore/__init__.py", line 1209, in _call_method
    return self.backends[object_store_id].__getattribute__(method)(obj, **kwargs)
  File "galaxy/objectstore/__init__.py", line 878, in _get_filename
    raise ObjectNotFound

Failed to update Pulsar job status for job_id (58043483/58043483)

May 21 '24 12:05 galaxyproject-sentryintegration[bot]