server icon indicating copy to clipboard operation
server copied to clipboard

Accumulate inference time with an ensemble model is way slower than the slowest individual

Open zhaohb opened this issue 2 years ago • 3 comments

Description I met the same problem at https://github.com/triton-inference-server/server/issues/3245

Triton Information 22.05

Are you using the Triton container or did you build it yourself? Triton container

Here are my results, all model instance is 1 : image

command: ./perf_client -a -b 256 -u localhost:8001 -i gRPC -m model_name --shared-memory cuda --concurrency-range 4

It can be seen from the results that the feature column model is the bottleneck. Theoretically, the overall performance after ensemble should be similar to that of the Feature Column model, but it was found that the performance after Ensemble was worse than any other model.

I set the number of instances for the feature column to 2, but found no impact on performance. Is it because GPU utilization is already high, so adding instance doesn't improve performance?

So, how do I optimize? thank you very much.

zhaohb avatar Jul 07 '22 02:07 zhaohb

Yes. If running one of your models hits 70-79% of your GPU, running multiple models at the same time will take up more space on your GPU (having them loaded and moving data), leaving less space for data movement and making them run slower. In this case, adding instances won't do much, since you still are constrained by your hardware. Adding instances makes more sense when you have excess GPU memory and want to run another instance of the model on the same GPU.

We have general optimization options within Triton here. It looks like you're already using shared memory, so that's good. You can try playing around with different configurations in Model Analyzer or Perf Analyzer, including trying different batch and concurrency values. Other than that, it's really comes down to the compute needs of your models versus your hardware.

dyastremsky avatar Jul 08 '22 15:07 dyastremsky

Thanks for your reply, but why is ensemble's performance worse than any of them when ensemble is used? Ensemble should be able to build pipelines, and its performance should be consistent with the worst of the two models, but now Ensemble is 20% worse than the worst model. Why is that?

thank you very much.

zhaohb avatar Jul 11 '22 01:07 zhaohb

Though you didn't list the GPU utilization for the ensemble model, my suspicion would be the GPU as mentioned. If running either of the models gets you close to full GPU utilization, running both models in an ensemble might require more GPU for it not to be a bottleneck.

Can you please share the perf analyzer output for each model in the ensemble as well as the perf analyzer output for the ensemble as a whole? That'll provide more information.

dyastremsky avatar Jul 11 '22 12:07 dyastremsky

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this.

krishung5 avatar Sep 07 '22 21:09 krishung5