server
server copied to clipboard
[QUESTION] About Concurrent Model Execution Feature
Hello,
Concurrent model execution is listed as one of core features of Triton Inference Server. Normally without Triton, I load multiple models on a single GPU and make inference on all of them. And I was assuming inferences are made somewhat concurrently. Could you please tell whether multiple models run concurrently on a single GPU without Trtion Inference Server?
I could not find a proper documentation telling how the behaviour is for this case with normal GPUs (without Triton Inference Server).
Regards
Here's how I've been predicting with multiple models per GPU without Triton: loading data, then forking processes, each loading keras and a model on its own.
https://www.reddit.com/r/learnmachinelearning/comments/xg8ybf/optimizing_parallel_predictions_on_gpu/
It seems there should be a similarly lightweight way to just transfer the input data to GPU once, rather than the same data in parallel.
Thanks for the answer.
However my models are independent from each other and each model will work on different data.
It demonstrates that you can run multiple models predicting with their own data streams on GPU without Triton, which answers your question. Here are several procs using GPU 0, for example.
0 N/A N/A 1923934 C /usr/bin/python3 741MiB |
0 N/A N/A 1924494 C /usr/bin/python3 741MiB |
0 N/A N/A 1924604 C /usr/bin/python3 741MiB |
0 N/A N/A 1924650 C /usr/bin/python3 741MiB |
0 N/A N/A 1924716 C /usr/bin/python3 741MiB |
0 N/A N/A 1924738 C /usr/bin/python3 741MiB |
0 N/A N/A 1924760 C /usr/bin/python3 741MiB |
0 N/A N/A 1924782 C /usr/bin/python3 741MiB |
You can also read about this in the architecture documentation
Thanks, I did read the documentation beforehand. I was just trying to understand if there is a high level logic other than a round robin like scheduling with dynamic batching. We have conducted some tests and we did not see an increase in performance when we coded round robin like scheduling and dynamic batching and put our models behind them.