clearml-serving
clearml-serving copied to clipboard
Adding clearml_serving_inference restart on CUDA OOM
Restart clearml-serving-inference
on torch.cuda.OutOfMemoryError: CUDA out of memory.
helps inference container to clear GPU memory. It would be useful for LLM inference on clearml-serving-inference
container (requires https://github.com/allegroai/clearml-helm-charts/blob/main/charts/clearml-serving/templates/clearml-serving-inference-deployment.yaml#L74 set to 1)
List of changes:
-
init.py
- add new variableCLEARML_INFERENCE_TASK_ID
that allow clearml_serving_inference to connect to existing task instead of creating a new one -
entrypoint.sh
- addCLEARML_INFERENCE_TASK_ID
logging -
main.py
- add gunicorn worker restart on CUDA OOM exception -
model_request_processor.py
- add some attempts to periodically clean GPU memory