onnxruntime_backend GPU memory leak with high load for ONNX model

GPU memory leak with high load for ONNX model

Open junwang-wish opened this issue 1 year ago • 3 comments

Description GPU memory leak with high load, GPU memory usage goes up and never come down when high load requests stop coming (memory never released)

Triton Information What version of Triton are you using?

23.02

Are you using the Triton container or did you build it yourself?

Triton container

To Reproduce Any ONNX model under high load would result in monotonically increasing GPU memory usage

Expected behavior When requests stop coming, GPU memory should be released

Jun 14 '23 20:06 junwang-wish

I wonder if the memory usage would come down if the model is unloaded (i.e. via the unload API).

cc @tanmayv25 if the memory usage is expected.

Jun 20 '23 19:06 kthui

Thx @kthui , so I don't want to unload the model, since it is used but on a unfrequent basis, ideally if a model is unused for a prolonged period of time (say 2 hours) the GPU memory would be freed

Jun 20 '23 20:06 junwang-wish

@junwang-wish which execution provider in ORT are you using? Are you using TRT or CUDA? Is your model having dynamic shaped inputs? I am transferring the issue to ORT backend team as it seems to be an issue with ORT.

Jun 22 '23 22:06 tanmayv25

onnxruntime_backend onnxruntime_backend copied to clipboard

GPU memory leak with high load for ONNX model

onnxruntime_backend
onnxruntime_backend copied to clipboard