server icon indicating copy to clipboard operation
server copied to clipboard

High Queue Latency With BLS

Open SandraWang-SH opened this issue 1 year ago • 4 comments

Description of problem: We are aimed to import BLS into Triton Server. After add BLS code in model, we fund that the latency of model has increased significantly.

On the Datadog, it can be seen that the latency of Ensemble is much lower than that of BLS. Of course, their server resource allocation and traffic are the same.

Avg: Call watch tower ensemble: 6.30ms Call watch tower use BLS: 32.04ms

image

After a series of monitoring, it was found that as traffic increases, queue latency also increases.

image

Do you know why BLS causes the queues? The performance of BLS seems to be much worse than Ensemble. Any idea about this?

SandraWang-SH avatar Sep 18 '24 09:09 SandraWang-SH

@Tabrizian Hi, Sorry to bother you,Could you please take some time to check this issue? Many thanks, Look forward to your reply.

SandraWang-SH avatar Sep 24 '24 05:09 SandraWang-SH

I think the issue might come from the BLS does not benefit from the ensemble scheduling.

Let's say you have a pipeline with 3 steps A,B,C and 4 Requests (R1,R2,R3,R4) in the queue.

In the BLS case, R2 will start to be processed once R1 has gone through all three steps A,B,C.

In the ensemble case, as soon as R1 moves from step A to step B, R2 starts to be processed in step A. As you can imagine this much more efficient as the requests spend way less time in the queue.

I would advise using ensembles as much as you can, keeping in mind they don't allow for control flow in between models. Below you will find a rough simulation of what would happen for a three steps model.

On top of the ensemble scheduling, I have observed that if you use the python and torch in your BLS model this adds a significant overhead to the pipeline.

Screenshot 2024-10-09 at 12 22 52

Screenshot 2024-10-09 at 12 23 03

Screenshot 2024-10-09 at 12 23 15 Screenshot 2024-10-09 at 12 23 46

MatthieuToulemont avatar Oct 09 '24 10:10 MatthieuToulemont

Hi @MatthieuToulemont Thanks for your reply. Your explanation makes sense. Has Triton considered optimizing BLS to achieve similar performance to ensemble? In fact, we have 3 steps A, B, and C. We want to add Cal Log (a log platform that can record transactions and events) to record the failure of each step. If BLS is used, appropriate logs can be printed when any step fails. If ensemble is used, there is an internal error and we have no way of knowing which step failed. BLS is very suitable for this Cal log scenario. However, the increase in latency prevents us from using BLS.

XiaoxueWang1 avatar Oct 10 '24 07:10 XiaoxueWang1

Has Triton considered optimizing BLS to achieve similar performance to ensemble?

I have no idea (I don't work at nvidia :) )

If ensemble is used, there is an internal error and we have no way of knowing which step failed.

Depending on the verbosity you set for triton you should be able to know in which step an error occurs.Which logging level do you use ?

MatthieuToulemont avatar Oct 10 '24 12:10 MatthieuToulemont

What @MatthieuToulemont is showing above is true only for synchronous coupled BLS. However, if you use decoupled BLS, one instance can process multiple concurrent requests. On top of that, if you would like your BLS instance to send multiple concurrent requests to the "children" models, use async execute.

okdimok avatar Nov 29 '24 08:11 okdimok

Ok I didn't know that's very interesting!

MatthieuToulemont avatar Nov 29 '24 13:11 MatthieuToulemont