MLServer
MLServer copied to clipboard
Server appears healthy even when all workers are down
I experienced an issue, when all parallel workers crash for foreseeable reasons, mlserver continues to run as if it was healthy but it is no longer able to process any (rest api) requests.
After looking a bit into the code base it seems that _get_worker() in dispatcher.py does not check if the worker is still alive before sending the request to it.
It would be helpful if mlserver could catch and report crashed workers.
Great spot and diagnose @nandev ! Thanks for raising this one.
Ideally, we should restart (and reload) dead workers. That was part of the original spec, but was left for future iterations. Hopefully we can get some progress on this one in time for the next release.
Following up on this one, after having a deeper look, these are some of the nuances that we need to take into account for this one (writing them here just to ensure they don't fall through the cracks):
- For detecting when a worker has died, we can set up a
SIGCHLDsignal handler. That won't give us the dead worker PID, but we can trigger a general check across all worker pools to ensure all workers are up and running (and / or restart them if any of them is dead). - When reloading the model, we will need to ensure that the
Dispatcherforwardsmodel_updaterequests (to ensure new models make their way through during loading), while not sending any inference requests (since we won't be sure whether all models have been loaded already). For this, we can introduce areadyflag within the workers, which will only be set toTrueonce all models have been reloaded (i.e. once it catches up with the other workers in the pool).