serve
serve copied to clipboard
Torchserve Workflow Fails at Medium QPS
We have 2 onnx models deployed in a GPU machine built on top of the nightly docker image.
- The first model runs with 0 failure at 500 QPS (p99 latency < 8ms) during a 2-hour perf test.
- The second model runs with 0 failure at 500 QPS (p99 latency < 11ms) during a 2-hour perf test. But some improvement in p99 latency (<9ms) at a reduce QPS of 400.
- When I try a sequential workflow that starts with the first model and, in ~1% of the cases, triggers the second model, then the machine becomes irresponsive after a few minutes at 100 QPS, causing the perf test to fail. After a few hours, I accidently discovered that the machine became responsive again (I don't know when exactly, though).
- Running this same workflow with only 20 QPS, the perf test succeeds for a duration of 24 hours (with only 52 failures).
I suspect there is a delay in releasing the resources that becomes an issue only with high QPS (these resources are eventually released later, bring the machine back to life).
@mossaab0 What version of TS are you using? Can you try building TS from source and let me know if it still fails. I suspect this is the same issue for which I pushed fix #1552 (will be added to the next release)
@maaquib This is based on torchserve-nightly:gpu-2022.04.13 which already includes the #1552 fix. Before the fix, even 20 QPS was failing.
@mossaab0 If you can provide some reproduction steps, I can try to rootcause this
@maaquib it is a bit difficult to provide more reproduction steps, as that would basically mean sharing the models. But I think here is something you can try (which I haven't tried, though). Figure out the maximum QPS that a GPU node can handle for the cat / dog classifier (for a couple of hours). Then, run a perf test with half of that QPS using the sequential workflow (i.e., including dog breeds model) for a couple of hours. I expect the second perf test to fail.
Hi @mossaab0 we've discussed this internally, we're in the progress of redesigning how workflows work and make it possible to define a DAG within your handler file in python.
It should be possible to take an existing sequential workflow or parallel workflow and refactor it a new nn.Module
or handler.py
please ping me if you need any advice on how to do this
Hi @mossaab0 we've discussed this internally, we're in the progress of redesigning how workflows work and make it possible to define a DAG within your handler file in python.
It should be possible to take an existing sequential workflow or parallel workflow and refactor it a new
nn.Module
orhandler.py
please ping me if you need any advice on how to do this
I'm also running into this. Any pointers to what the refactor would look like?