Foundry-Local
Foundry-Local copied to clipboard
Out of memory is not well handled
Foundry local version 0.6.87+e69a6c3d2b
Repro steps:
- Load model A
- Call /v1/chat/completions API with model A
- Load model B
- Call /v1/chat/completions API with model B
The server returns http 500 error without an error message or error code. So the client is unable to know what happened. By checking the log, I can see
E:\_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:129 onnxruntime::CudaCall E:\_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:121 onnxruntime::CudaCall CUDA failure 2: out of memory ; GPU=0 ; hostname=ALEX-P14S ; file=E:\_work\1\s\onnxruntime\core\providers\cuda\cuda_execution_provider.cc ; line=287 ; expr=cudaDeviceSynchronize();
Expected:
- The 500 error body should contain error code or message indicating out of memory error.
- The model could be automatically unloaded if not used for sometime like Ollama.