kaleidoscope icon indicating copy to clipboard operation
kaleidoscope copied to clipboard

Health check failures do not clean up properly

Open markcoatsworth opened this issue 1 year ago • 0 comments

Describe the bug When a model job fails a health check, it moves immediately into a FailedState and hence the gateway never sends the request to shut the job down. We should send the shutdown request first. Celery logs as follows:

[2023-05-01 20:59:58,426: WARNING/ForkPoolWorker-3] The model is healthy [2023-05-03 18:42:23,284: WARNING/ForkPoolWorker-3] Model health verification error:: HTTPConnectionPool(host='172.17.8.109', port=43537): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb5c7984b80>: Failed to establish a new connection: [Errno 111] Connection refused')) [2023-05-03 18:42:23,285: WARNING/ForkPoolWorker-3] [2023-05-03 18:42:23,285] ERROR in models: Health check for active model OPT-6.7B failed [2023-05-03 18:42:23,285: ERROR/ForkPoolWorker-3] Health check for active model OPT-6.7B failed [2023-05-03 18:42:23,429: ERROR/ForkPoolWorker-3] Task tasks.verify_model_instance_health[94ca39cf-8249-46de-ae48-605aae58ec48] raised unexpected: InvalidStateError("Invalid operation for model instance state: <class 'models.FailedState'>") Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 451, in trace_task R = retval = fun(*args, **kwargs) File "/app/gateway_service.py", line 55, in call return self.run(*args, **kwargs) File "/app/tasks.py", line 11, in verify_model_instance_health model_instance.shutdown() File "/app/models.py", line 302, in shutdown self._state.shutdown() File "/app/models.py", line 200, in shutdown raise InvalidStateError(self) errors.InvalidStateError: Invalid operation for model instance state: <class 'models.FailedState'>

markcoatsworth avatar May 04 '23 14:05 markcoatsworth