serve icon indicating copy to clipboard operation
serve copied to clipboard

Torchserve Workflow Fails at Medium QPS

Open mossaab0 opened this issue 2 years ago • 6 comments

We have 2 onnx models deployed in a GPU machine built on top of the nightly docker image.

  • The first model runs with 0 failure at 500 QPS (p99 latency < 8ms) during a 2-hour perf test.
  • The second model runs with 0 failure at 500 QPS (p99 latency < 11ms) during a 2-hour perf test. But some improvement in p99 latency (<9ms) at a reduce QPS of 400.
  • When I try a sequential workflow that starts with the first model and, in ~1% of the cases, triggers the second model, then the machine becomes irresponsive after a few minutes at 100 QPS, causing the perf test to fail. After a few hours, I accidently discovered that the machine became responsive again (I don't know when exactly, though).
  • Running this same workflow with only 20 QPS, the perf test succeeds for a duration of 24 hours (with only 52 failures).

I suspect there is a delay in releasing the resources that becomes an issue only with high QPS (these resources are eventually released later, bring the machine back to life).

mossaab0 avatar Apr 21 '22 16:04 mossaab0

@mossaab0 What version of TS are you using? Can you try building TS from source and let me know if it still fails. I suspect this is the same issue for which I pushed fix #1552 (will be added to the next release)

maaquib avatar Apr 21 '22 18:04 maaquib

@maaquib This is based on torchserve-nightly:gpu-2022.04.13 which already includes the #1552 fix. Before the fix, even 20 QPS was failing.

mossaab0 avatar Apr 21 '22 18:04 mossaab0

@mossaab0 If you can provide some reproduction steps, I can try to rootcause this

maaquib avatar Apr 21 '22 21:04 maaquib

@maaquib it is a bit difficult to provide more reproduction steps, as that would basically mean sharing the models. But I think here is something you can try (which I haven't tried, though). Figure out the maximum QPS that a GPU node can handle for the cat / dog classifier (for a couple of hours). Then, run a perf test with half of that QPS using the sequential workflow (i.e., including dog breeds model) for a couple of hours. I expect the second perf test to fail.

mossaab0 avatar Apr 21 '22 21:04 mossaab0

Hi @mossaab0 we've discussed this internally, we're in the progress of redesigning how workflows work and make it possible to define a DAG within your handler file in python.

It should be possible to take an existing sequential workflow or parallel workflow and refactor it a new nn.Module or handler.py please ping me if you need any advice on how to do this

msaroufim avatar May 09 '22 21:05 msaroufim

Hi @mossaab0 we've discussed this internally, we're in the progress of redesigning how workflows work and make it possible to define a DAG within your handler file in python.

It should be possible to take an existing sequential workflow or parallel workflow and refactor it a new nn.Module or handler.py please ping me if you need any advice on how to do this

I'm also running into this. Any pointers to what the refactor would look like?

jonhilgart22 avatar Aug 02 '22 16:08 jonhilgart22