server [QUESTION] About Concurrent Model Execution Feature

Hello,

Concurrent model execution is listed as one of core features of Triton Inference Server. Normally without Triton, I load multiple models on a single GPU and make inference on all of them. And I was assuming inferences are made somewhat concurrently. Could you please tell whether multiple models run concurrently on a single GPU without Trtion Inference Server?

I could not find a proper documentation telling how the behaviour is for this case with normal GPUs (without Triton Inference Server).

Regards

Sep 16 '22 14:09 alercelik

Here's how I've been predicting with multiple models per GPU without Triton: loading data, then forking processes, each loading keras and a model on its own.

https://www.reddit.com/r/learnmachinelearning/comments/xg8ybf/optimizing_parallel_predictions_on_gpu/

It seems there should be a similarly lightweight way to just transfer the input data to GPU once, rather than the same data in parallel.

Sep 17 '22 07:09 phobrain

Thanks for the answer.

However my models are independent from each other and each model will work on different data.

Sep 19 '22 05:09 alercelik

It demonstrates that you can run multiple models predicting with their own data streams on GPU without Triton, which answers your question. Here are several procs using GPU 0, for example.

 0   N/A  N/A   1923934      C   /usr/bin/python3                  741MiB |
 0   N/A  N/A   1924494      C   /usr/bin/python3                  741MiB |
 0   N/A  N/A   1924604      C   /usr/bin/python3                  741MiB |
 0   N/A  N/A   1924650      C   /usr/bin/python3                  741MiB |
 0   N/A  N/A   1924716      C   /usr/bin/python3                  741MiB |
 0   N/A  N/A   1924738      C   /usr/bin/python3                  741MiB |
 0   N/A  N/A   1924760      C   /usr/bin/python3                  741MiB |
 0   N/A  N/A   1924782      C   /usr/bin/python3                  741MiB |

Sep 19 '22 17:09 phobrain

You can also read about this in the architecture documentation

Sep 23 '22 02:09 jbkyang-nvi

Thanks, I did read the documentation beforehand. I was just trying to understand if there is a high level logic other than a round robin like scheduling with dynamic batching. We have conducted some tests and we did not see an increase in performance when we coded round robin like scheduling and dynamic batching and put our models behind them.

Sep 28 '22 06:09 alercelik

server server copied to clipboard

[QUESTION] About Concurrent Model Execution Feature

server
server copied to clipboard