Out of memory is not well handled

Open a1exwang opened this issue 4 months ago • 0 comments

Foundry local version 0.6.87+e69a6c3d2b

Repro steps:

Load model A
Call /v1/chat/completions API with model A
Load model B
Call /v1/chat/completions API with model B

The server returns http 500 error without an error message or error code. So the client is unable to know what happened. By checking the log, I can see

E:\_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:129 onnxruntime::CudaCall E:\_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:121 onnxruntime::CudaCall CUDA failure 2: out of memory ; GPU=0 ; hostname=ALEX-P14S ; file=E:\_work\1\s\onnxruntime\core\providers\cuda\cuda_execution_provider.cc ; line=287 ; expr=cudaDeviceSynchronize();

Expected:

The 500 error body should contain error code or message indicating out of memory error.
The model could be automatically unloaded if not used for sometime like Ollama.

AB#74041

Aug 06 '25 06:08 a1exwang