onnxruntime_backend [Question] Multiple model inputs and GPU allocations

[Question] Multiple model inputs and GPU allocations

Open msyulia opened this issue 1 year ago • 0 comments

trafficstars

Hi!

I wasn't sure whether to place this under bug or whether it works as intended

I'm currently facing an issue where a model, deployed via Triton ONNX Backend, with up to a hundred inputs has a relatively high nv_inference_compute_input_duration_us, which from my understanding this metric also includes copying tensor data to GPU. Is there a possibility that each input results in a seperate GPU allocator call?

From what I see in ModelInstanceState::SetInputTensors https://github.com/triton-inference-server/onnxruntime_backend/blob/main/src/onnxruntime.cc#L2273 inputs are processed sequentially and each input results in a call to CreateTensorWithDataAsOrtValue is it possible that this could result in seperate GPU allocations and copies therefore a long nv_inference_compute_input_duration_us? Or is copying tensor data to GPU happening before a request is passed to the ONNX Backend?

Aug 29 '24 10:08 msyulia

onnxruntime_backend onnxruntime_backend copied to clipboard

[Question] Multiple model inputs and GPU allocations

onnxruntime_backend
onnxruntime_backend copied to clipboard