MLServer icon indicating copy to clipboard operation
MLServer copied to clipboard

Server appears healthy even when all workers are down

Open nandev opened this issue 3 years ago • 1 comments
trafficstars

I experienced an issue, when all parallel workers crash for foreseeable reasons, mlserver continues to run as if it was healthy but it is no longer able to process any (rest api) requests. After looking a bit into the code base it seems that _get_worker() in dispatcher.py does not check if the worker is still alive before sending the request to it. It would be helpful if mlserver could catch and report crashed workers.

nandev avatar Sep 01 '22 10:09 nandev

Great spot and diagnose @nandev ! Thanks for raising this one.

Ideally, we should restart (and reload) dead workers. That was part of the original spec, but was left for future iterations. Hopefully we can get some progress on this one in time for the next release.

adriangonz avatar Sep 01 '22 16:09 adriangonz

Following up on this one, after having a deeper look, these are some of the nuances that we need to take into account for this one (writing them here just to ensure they don't fall through the cracks):

  • For detecting when a worker has died, we can set up a SIGCHLD signal handler. That won't give us the dead worker PID, but we can trigger a general check across all worker pools to ensure all workers are up and running (and / or restart them if any of them is dead).
  • When reloading the model, we will need to ensure that the Dispatcher forwards model_update requests (to ensure new models make their way through during loading), while not sending any inference requests (since we won't be sure whether all models have been loaded already). For this, we can introduce a ready flag within the workers, which will only be set to True once all models have been reloaded (i.e. once it catches up with the other workers in the pool).

adriangonz avatar Apr 03 '23 14:04 adriangonz