mmdeploy icon indicating copy to clipboard operation
mmdeploy copied to clipboard

[Feature] Triton server

Open irexyc opened this issue 2 years ago • 7 comments

Motivation

Support model serving

Modification

Add triton custom backend Add demo

irexyc avatar May 18 '23 06:05 irexyc

Codecov Report

Patch and project coverage have no change.

Comparison is base (8e658cd) 49.67% compared to head (fcdf52f) 49.67%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2088   +/-   ##
=======================================
  Coverage   49.67%   49.67%           
=======================================
  Files         339      339           
  Lines       12998    12998           
  Branches     1906     1906           
=======================================
  Hits         6457     6457           
  Misses       6090     6090           
  Partials      451      451           
Flag Coverage Δ
unittests 49.67% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar May 18 '23 06:05 codecov[bot]

can temporarily use this docker image for testing

docker pull irexyc/mmmdeploy:triton-22.12

irexyc avatar May 23 '23 06:05 irexyc

Hey, thanks for this. I wanted to know how do I correctly send multiple bboxes for keypoint-detection inference.

I created a dict for each bbox here and added to the value list, and used that, but the results are not accurate, although the number of keypoints returned matches the number of bboxes.

bbox_list = [{'bbox':bbox} for bbox in bboxes.tolist()]
bbox = {
    'type': 'PoseBbox',
    'value': bbox_list
}

Y-T-G avatar Oct 11 '23 07:10 Y-T-G

Also, what does this mean only support batch dim 1 for single request? Does this mean the Triton version does not support batch inference?

Y-T-G avatar Oct 14 '23 15:10 Y-T-G

@Y-T-G

I created a dict for each bbox here and added to the value list, and used that, but the results are not accurate, although the number of keypoints returned matches the number of bboxes.

Cou you show the visualize result with bboxes? Are the inference result with single bbox looks right?

Also, what does this mean only support batch dim 1 for single request? Does this mean the Triton version does not support batch inference?

For batch inference of mmdeploy, you can refer to this https://github.com/open-mmlab/mmdeploy/issues/839#issuecomment-1206029364

Triton server support dynamic batcher and sequence batcher. But mmdeploy backend only support dynamic batcher. You can add these lines to config.pbtxt.

dynamic_batching {
  max_queue_delay_microseconds: 100
}

With allow_ragged_batch and dynamic_batching, mmdeploy backend can receive a batch of requests for each inference step (therefore, you don't have to construct normal batch input like b x c x h x w. you only need send c x h x w to triton server and let it to collect batch requests.)

In summary, to use mmdeploy triton backend with batch inference, you have to:

  1. convert the model with batch inference support and edit the pipeline.json
  2. add dynamic_batching to config.pbtxt

irexyc avatar Oct 16 '23 12:10 irexyc

@irexyc

Cou you show the visualize result with bboxes? Are the inference result with single bbox looks right?

Yes with single bbox the inference is correct. But if I add more than one bbox, the outputs don't make any sense.

I only visualize the nose, left wrist and right wrist keypoints. This is from RTMPose.

This is how it looks when I add more than 1: image

This is how it looks like when I do individually, cropping each bbox and sending each for inference separately: image

The input for multiple bbox looks like this:

{
   "type":"PoseBbox",
   "value":[
      {
         "bbox":[
            866,
            47,
            896,
            101
         ]
      },
      {
         "bbox":[
            48,
            65,
            73,
            125
         ]
      },
      {
         "bbox":[
            425,
            32,
            447,
            97
         ]
      },
      ....
      ....
      ....
      ....
   ]
}

Y-T-G avatar Oct 16 '23 13:10 Y-T-G

For batch inference of mmdeploy, you can refer to this #839 (comment)

Triton server support dynamic batcher and sequence batcher. But mmdeploy backend only support dynamic batcher. You can add these lines to config.pbtxt.

dynamic_batching {
  max_queue_delay_microseconds: 100
}

With allow_ragged_batch and dynamic_batching, mmdeploy backend can receive a batch of requests for each inference step (therefore, you don't have to construct normal batch input like b x c x h x w. you only need send c x h x w to triton server and let it to collect batch requests.)

In summary, to use mmdeploy triton backend with batch inference, you have to:

  1. convert the model with batch inference support and edit the pipeline.json
  2. add dynamic_batching to config.pbtxt

I am not sure if this works. I don't see any improvements when I do this after checking with perf_analyzer for ResNet18:

Inferences/Second vs. Client Average Batch Latency
Concurrency: 2, throughput: 41.6086 infer/sec, latency 47983 usec
Concurrency: 3, throughput: 41.8305 infer/sec, latency 71626 usec
Concurrency: 4, throughput: 41.775 infer/sec, latency 95672 usec
Concurrency: 5, throughput: 41.2752 infer/sec, latency 120931 usec
Concurrency: 6, throughput: 41.7747 infer/sec, latency 143440 usec
Concurrency: 7, throughput: 41.7748 infer/sec, latency 167467 usec
Concurrency: 8, throughput: 41.6641 infer/sec, latency 191807 usec

It supports batching in the json.

I can see better improvements by launching multiple model instances using:

instance_group [ 
  { 
    count: 4
    kind: KIND_GPU 
  }
]
Inferences/Second vs. Client Average Batch Latency
Concurrency: 2, throughput: 62.7163 infer/sec, latency 31838 usec
Concurrency: 3, throughput: 65.1612 infer/sec, latency 46015 usec
Concurrency: 4, throughput: 79.328 infer/sec, latency 50415 usec
Concurrency: 5, throughput: 84.3826 infer/sec, latency 59160 usec
Concurrency: 6, throughput: 90.2152 infer/sec, latency 66516 usec
Concurrency: 7, throughput: 89.4926 infer/sec, latency 78322 usec
Concurrency: 8, throughput: 88.104 infer/sec, latency 90731 usec

I think dynamic_batcher depends on sequence_batching. But since each request is handled separately in instance_state.cpp, dynamic_batching will not have any effect. To have an effect, the requests have to batched and then inferred all at once.

Y-T-G avatar Oct 16 '23 13:10 Y-T-G