clearml-serving icon indicating copy to clipboard operation
clearml-serving copied to clipboard

Adding clearml_serving_inference restart on CUDA OOM

Open IlyaMescheryakov1402 opened this issue 2 months ago • 0 comments

Restart clearml-serving-inference on torch.cuda.OutOfMemoryError: CUDA out of memory. helps inference container to clear GPU memory. It would be useful for LLM inference on clearml-serving-inference container (requires https://github.com/allegroai/clearml-helm-charts/blob/main/charts/clearml-serving/templates/clearml-serving-inference-deployment.yaml#L74 set to 1)

List of changes:

  1. init.py - add new variable CLEARML_INFERENCE_TASK_ID that allow clearml_serving_inference to connect to existing task instead of creating a new one
  2. entrypoint.sh - add CLEARML_INFERENCE_TASK_ID logging
  3. main.py - add gunicorn worker restart on CUDA OOM exception
  4. model_request_processor.py - add some attempts to periodically clean GPU memory

IlyaMescheryakov1402 avatar Apr 15 '24 22:04 IlyaMescheryakov1402