server
server copied to clipboard
Accumulate inference time with an ensemble model is way slower than the slowest individual
Description I met the same problem at https://github.com/triton-inference-server/server/issues/3245
Triton Information 22.05
Are you using the Triton container or did you build it yourself? Triton container
Here are my results, all model instance is 1 :
command: ./perf_client -a -b 256 -u localhost:8001 -i gRPC -m model_name --shared-memory cuda --concurrency-range 4
It can be seen from the results that the feature column model is the bottleneck. Theoretically, the overall performance after ensemble should be similar to that of the Feature Column model, but it was found that the performance after Ensemble was worse than any other model.
I set the number of instances for the feature column to 2, but found no impact on performance. Is it because GPU utilization is already high, so adding instance doesn't improve performance?
So, how do I optimize? thank you very much.
Yes. If running one of your models hits 70-79% of your GPU, running multiple models at the same time will take up more space on your GPU (having them loaded and moving data), leaving less space for data movement and making them run slower. In this case, adding instances won't do much, since you still are constrained by your hardware. Adding instances makes more sense when you have excess GPU memory and want to run another instance of the model on the same GPU.
We have general optimization options within Triton here. It looks like you're already using shared memory, so that's good. You can try playing around with different configurations in Model Analyzer or Perf Analyzer, including trying different batch and concurrency values. Other than that, it's really comes down to the compute needs of your models versus your hardware.
Thanks for your reply, but why is ensemble's performance worse than any of them when ensemble is used? Ensemble should be able to build pipelines, and its performance should be consistent with the worst of the two models, but now Ensemble is 20% worse than the worst model. Why is that?
thank you very much.
Though you didn't list the GPU utilization for the ensemble model, my suspicion would be the GPU as mentioned. If running either of the models gets you close to full GPU utilization, running both models in an ensemble might require more GPU for it not to be a bottleneck.
Can you please share the perf analyzer output for each model in the ensemble as well as the perf analyzer output for the ensemble as a whole? That'll provide more information.
Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this.