[Feature] Triton server
Motivation
Support model serving
Modification
Add triton custom backend Add demo
Codecov Report
Patch and project coverage have no change.
Comparison is base (
8e658cd) 49.67% compared to head (fcdf52f) 49.67%.
Additional details and impacted files
@@ Coverage Diff @@
## main #2088 +/- ##
=======================================
Coverage 49.67% 49.67%
=======================================
Files 339 339
Lines 12998 12998
Branches 1906 1906
=======================================
Hits 6457 6457
Misses 6090 6090
Partials 451 451
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 49.67% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
can temporarily use this docker image for testing
docker pull irexyc/mmmdeploy:triton-22.12
Hey, thanks for this. I wanted to know how do I correctly send multiple bboxes for keypoint-detection inference.
I created a dict for each bbox here and added to the value list, and used that, but the results are not accurate, although the number of keypoints returned matches the number of bboxes.
bbox_list = [{'bbox':bbox} for bbox in bboxes.tolist()]
bbox = {
'type': 'PoseBbox',
'value': bbox_list
}
Also, what does this mean only support batch dim 1 for single request? Does this mean the Triton version does not support batch inference?
@Y-T-G
I created a dict for each bbox here and added to the value list, and used that, but the results are not accurate, although the number of keypoints returned matches the number of bboxes.
Cou you show the visualize result with bboxes? Are the inference result with single bbox looks right?
Also, what does this mean only support batch dim 1 for single request? Does this mean the Triton version does not support batch inference?
For batch inference of mmdeploy, you can refer to this https://github.com/open-mmlab/mmdeploy/issues/839#issuecomment-1206029364
Triton server support dynamic batcher and sequence batcher. But mmdeploy backend only support dynamic batcher. You can add these lines to config.pbtxt.
dynamic_batching {
max_queue_delay_microseconds: 100
}
With allow_ragged_batch and dynamic_batching, mmdeploy backend can receive a batch of requests for each inference step (therefore, you don't have to construct normal batch input like b x c x h x w. you only need send c x h x w to triton server and let it to collect batch requests.)
In summary, to use mmdeploy triton backend with batch inference, you have to:
- convert the model with batch inference support and edit the pipeline.json
- add
dynamic_batchingto config.pbtxt
@irexyc
Cou you show the visualize result with bboxes? Are the inference result with single bbox looks right?
Yes with single bbox the inference is correct. But if I add more than one bbox, the outputs don't make any sense.
I only visualize the nose, left wrist and right wrist keypoints. This is from RTMPose.
This is how it looks when I add more than 1:
This is how it looks like when I do individually, cropping each bbox and sending each for inference separately:
The input for multiple bbox looks like this:
{
"type":"PoseBbox",
"value":[
{
"bbox":[
866,
47,
896,
101
]
},
{
"bbox":[
48,
65,
73,
125
]
},
{
"bbox":[
425,
32,
447,
97
]
},
....
....
....
....
]
}
For batch inference of mmdeploy, you can refer to this #839 (comment)
Triton server support dynamic batcher and sequence batcher. But mmdeploy backend only support dynamic batcher. You can add these lines to config.pbtxt.
dynamic_batching { max_queue_delay_microseconds: 100 }With allow_ragged_batch and
dynamic_batching, mmdeploy backend can receive a batch of requests for each inference step (therefore, you don't have to construct normal batch input like b x c x h x w. you only need send c x h x w to triton server and let it to collect batch requests.)In summary, to use mmdeploy triton backend with batch inference, you have to:
- convert the model with batch inference support and edit the pipeline.json
- add
dynamic_batchingto config.pbtxt
I am not sure if this works. I don't see any improvements when I do this after checking with perf_analyzer for ResNet18:
Inferences/Second vs. Client Average Batch Latency
Concurrency: 2, throughput: 41.6086 infer/sec, latency 47983 usec
Concurrency: 3, throughput: 41.8305 infer/sec, latency 71626 usec
Concurrency: 4, throughput: 41.775 infer/sec, latency 95672 usec
Concurrency: 5, throughput: 41.2752 infer/sec, latency 120931 usec
Concurrency: 6, throughput: 41.7747 infer/sec, latency 143440 usec
Concurrency: 7, throughput: 41.7748 infer/sec, latency 167467 usec
Concurrency: 8, throughput: 41.6641 infer/sec, latency 191807 usec
It supports batching in the json.
I can see better improvements by launching multiple model instances using:
instance_group [
{
count: 4
kind: KIND_GPU
}
]
Inferences/Second vs. Client Average Batch Latency
Concurrency: 2, throughput: 62.7163 infer/sec, latency 31838 usec
Concurrency: 3, throughput: 65.1612 infer/sec, latency 46015 usec
Concurrency: 4, throughput: 79.328 infer/sec, latency 50415 usec
Concurrency: 5, throughput: 84.3826 infer/sec, latency 59160 usec
Concurrency: 6, throughput: 90.2152 infer/sec, latency 66516 usec
Concurrency: 7, throughput: 89.4926 infer/sec, latency 78322 usec
Concurrency: 8, throughput: 88.104 infer/sec, latency 90731 usec
I think dynamic_batcher depends on sequence_batching. But since each request is handled separately in instance_state.cpp, dynamic_batching will not have any effect. To have an effect, the requests have to batched and then inferred all at once.