onnxruntime_backend
onnxruntime_backend copied to clipboard
Global GPU Memory Limit
Is there a way to control (limit) the global GPU memory usage of the onnxruntime backend in triton? The tensorflow backend has the following CLI:
--backend-config tensorflow,gpu-memory-fraction=X
I wonder whether there is as a way to control the global allocated GPU memory for the onnx backend? Would the onnxbackend dynamically release unused memory? If so, is there a GPU memory allocation pattern one can expect when doing, for example, first an inference with model A and then with model B?