mmdeploy [Feature] Triton server

Motivation

Support model serving

Modification

Add triton custom backend Add demo

May 18 '23 06:05 irexyc

Codecov Report

Patch and project coverage have no change.

Comparison is base (8e658cd) 49.67% compared to head (fcdf52f) 49.67%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2088   +/-   ##
=======================================
  Coverage   49.67%   49.67%           
=======================================
  Files         339      339           
  Lines       12998    12998           
  Branches     1906     1906           
=======================================
  Hits         6457     6457           
  Misses       6090     6090           
  Partials      451      451

Flag	Coverage Δ
unittests	`49.67% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

May 18 '23 06:05 codecov[bot]

can temporarily use this docker image for testing

docker pull irexyc/mmmdeploy:triton-22.12

May 23 '23 06:05 irexyc

Hey, thanks for this. I wanted to know how do I correctly send multiple bboxes for keypoint-detection inference.

I created a dict for each bbox here and added to the value list, and used that, but the results are not accurate, although the number of keypoints returned matches the number of bboxes.

bbox_list = [{'bbox':bbox} for bbox in bboxes.tolist()]
bbox = {
    'type': 'PoseBbox',
    'value': bbox_list
}

Oct 11 '23 07:10 Y-T-G

Also, what does this mean only support batch dim 1 for single request? Does this mean the Triton version does not support batch inference?

Oct 14 '23 15:10 Y-T-G

@Y-T-G

I created a dict for each bbox here and added to the value list, and used that, but the results are not accurate, although the number of keypoints returned matches the number of bboxes.

Cou you show the visualize result with bboxes? Are the inference result with single bbox looks right?

Also, what does this mean only support batch dim 1 for single request? Does this mean the Triton version does not support batch inference?

For batch inference of mmdeploy, you can refer to this https://github.com/open-mmlab/mmdeploy/issues/839#issuecomment-1206029364

Triton server support dynamic batcher and sequence batcher. But mmdeploy backend only support dynamic batcher. You can add these lines to config.pbtxt.

dynamic_batching {
  max_queue_delay_microseconds: 100
}

With allow_ragged_batch and dynamic_batching, mmdeploy backend can receive a batch of requests for each inference step (therefore, you don't have to construct normal batch input like b x c x h x w. you only need send c x h x w to triton server and let it to collect batch requests.)

In summary, to use mmdeploy triton backend with batch inference, you have to:

convert the model with batch inference support and edit the pipeline.json
add dynamic_batching to config.pbtxt

Oct 16 '23 12:10 irexyc

@irexyc

Cou you show the visualize result with bboxes? Are the inference result with single bbox looks right?

Yes with single bbox the inference is correct. But if I add more than one bbox, the outputs don't make any sense.

I only visualize the nose, left wrist and right wrist keypoints. This is from RTMPose.

This is how it looks when I add more than 1:

This is how it looks like when I do individually, cropping each bbox and sending each for inference separately:

The input for multiple bbox looks like this:

{
   "type":"PoseBbox",
   "value":[
      {
         "bbox":[
            866,
            47,
            896,
            101
         ]
      },
      {
         "bbox":[
            48,
            65,
            73,
            125
         ]
      },
      {
         "bbox":[
            425,
            32,
            447,
            97
         ]
      },
      ....
      ....
      ....
      ....
   ]
}

Oct 16 '23 13:10 Y-T-G

For batch inference of mmdeploy, you can refer to this #839 (comment)

Triton server support dynamic batcher and sequence batcher. But mmdeploy backend only support dynamic batcher. You can add these lines to config.pbtxt.
dynamic_batching {
  max_queue_delay_microseconds: 100
}
With allow_ragged_batch and dynamic_batching, mmdeploy backend can receive a batch of requests for each inference step (therefore, you don't have to construct normal batch input like b x c x h x w. you only need send c x h x w to triton server and let it to collect batch requests.)

In summary, to use mmdeploy triton backend with batch inference, you have to:

convert the model with batch inference support and edit the pipeline.json

add dynamic_batching to config.pbtxt

I am not sure if this works. I don't see any improvements when I do this after checking with perf_analyzer for ResNet18:

Inferences/Second vs. Client Average Batch Latency
Concurrency: 2, throughput: 41.6086 infer/sec, latency 47983 usec
Concurrency: 3, throughput: 41.8305 infer/sec, latency 71626 usec
Concurrency: 4, throughput: 41.775 infer/sec, latency 95672 usec
Concurrency: 5, throughput: 41.2752 infer/sec, latency 120931 usec
Concurrency: 6, throughput: 41.7747 infer/sec, latency 143440 usec
Concurrency: 7, throughput: 41.7748 infer/sec, latency 167467 usec
Concurrency: 8, throughput: 41.6641 infer/sec, latency 191807 usec

It supports batching in the json.

I can see better improvements by launching multiple model instances using:

instance_group [ 
  { 
    count: 4
    kind: KIND_GPU 
  }
]

Inferences/Second vs. Client Average Batch Latency
Concurrency: 2, throughput: 62.7163 infer/sec, latency 31838 usec
Concurrency: 3, throughput: 65.1612 infer/sec, latency 46015 usec
Concurrency: 4, throughput: 79.328 infer/sec, latency 50415 usec
Concurrency: 5, throughput: 84.3826 infer/sec, latency 59160 usec
Concurrency: 6, throughput: 90.2152 infer/sec, latency 66516 usec
Concurrency: 7, throughput: 89.4926 infer/sec, latency 78322 usec
Concurrency: 8, throughput: 88.104 infer/sec, latency 90731 usec

I think dynamic_batcher depends on sequence_batching. But since each request is handled separately in instance_state.cpp, dynamic_batching will not have any effect. To have an effect, the requests have to batched and then inferred all at once.

Oct 16 '23 13:10 Y-T-G