server No-copy Tensor transfer in python backend-based ensemble

I am trying to build a pipelined inference server with a mainly python backend (it runs PyTorch models sometimes in the code itself). Originally I had the entire pipeline run in one big python function and served that model alone. I now want to optimize performance so I am attempting to create an ensemble made up of building blocks each running a different part of the pipeline. I am doing this with the hopes Triton's dynamic batching and optimized resource allocation and utilization will increase throughput (with not-so-bad latency increases).

For the ensemble to improve performance, I need the transfer of tensors from model to model not to involve copying. My tensors are all on the GPU; and too big to all be copied DtoD all at once. In this issue, it is stated that Triton still copies the tensors, even if not to the CPU and back. As stated above, this is not viable or wanted performance-wise. The only way currently known to me for avoiding copying is to manage the tensor lifecycle on my own (suggested here), but that is not practical and pretty bad practice IMO.

Is there a Triton way to do avoid copying between models with the same (python) backend? Is there a way to avoid copying between models with different backends, say onnxruntime and python?

Thanks in advance!

Aug 24 '22 14:08 nrakltx

Hi @nrakltx ,

As of today, I suspect you might find better performance from the single model as opposed to the ensemble model due to some of the communication overhead issues you've pointed out.

For ways to get around those issues and avoid copies, maybe @tanmayv25 @GuanLuo @Tabrizian may have some thoughts.

Aug 24 '22 18:08 rmccorm4

Thank you! I'll be waiting for their advice.

Aug 24 '22 19:08 nrakltx

@nrakltx The design suggestion we can offer depends upon what you are optimizing for. Is it the per request latency? If yes, then using a single model with all its computation really makes sense. In order to increase the throughput, you can simply increase the number of instance counts of your model and would reap the benefit of concurrent executions. Note that this might hurt the per request latency but will increase the throughput given you have sufficient request availability.

Now, getting to the point about when using ensemble can give you perf benefits. So, in the above case all the parts of model can run independently contending for the same resource. In this case, moving from data parallel approach to pipelined parallelism offered by ensemble can be better. It gives you more control onto how to allocate resources to different parts of your algorithm. If you are already aware of what are your bottlenecks you can break the model down and scale up those sections. Doing this will definitely hurt your per request latency but again might offer you better throughput. And as you said you'd incur some additional memory copies as well. Actually for dynamic batching to work there is no way around avoiding memory copies.

However, with enough request concurrencies in the pipeline, these overheads should overlap with other request under execution. Hence, data copies on one request can ideally coincide with GPU execution for other request. The best performant architecture depends upon the nature of your algorithm within model and how you design to partition it.

Oct 22 '22 00:10 tanmayv25

Closing issue due to lack of activitity. If you need further support, please let us know and we can reopen the issue.

Nov 29 '22 17:11 dyastremsky

server server copied to clipboard

No-copy Tensor transfer in python backend-based ensemble

server
server copied to clipboard