server Torchscript backend **MUCH** slower only with FP16

Description I'm converting a pytorch model to torchscript with or without fp16 precision, and I get much slower triton inference when using FP16, even though a torchscript benchmark shows it to be about the same speed. I've used the perf analyzer and I found out that the step taking much time is "compute output", which is very weird, isn't it ?

Triton Information I'm using 21.12, 22.05 container versions and both show the same problem

To Reproduce

Here is the code to save the torchscript model :

import torch
import torchvision

export_model = torch.jit.script(torchvision.models.resnet101().cuda().eval())
torch.jit.save(
    export_model,
    "model.pt",
)
## for half precision version
export_model = torch.jit.script(torchvision.models.resnet101().cuda().eval().half())
torch.jit.save(
    export_model,
    "model.pt",
)

Here is my config.pbtxt contents :

name: "model"
platform: "pytorch_libtorch"
max_batch_size : 4
input [
  {
    name: "input__0"
    data_type: TYPE_FP32 / TYPE_FP16
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "output__0"
    data_type: TYPE_FP32 / TYPE_FP16
    dims: [ -1 ]
  }
]

Here is the sample output of the perf analyzer on FP16

root@be4c3d303700:/opt/tritonserver# perf_analyzer -m model --measurement-request-count 30
*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 89
    Throughput: 17.8 infer/sec
    Avg latency: 56556 usec (standard deviation 154 usec)
    p50 latency: 56512 usec
    p90 latency: 56546 usec
    p95 latency: 56944 usec
    p99 latency: 57214 usec
    Avg HTTP time: 56541 usec (send/recv 77 usec + response wait 56464 usec)
  Server: 
    Inference count: 106
    Execution count: 106
    Successful request count: 106
    Avg request latency: 56080 usec (overhead 32 usec + queue 10207 usec + compute input 121 usec + compute infer 15852 usec + compute output 29868 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 17.8 infer/sec, latency 56556 usec

And here is the FP32 analyzer output :

root@be4c3d303700:/opt/tritonserver# perf_analyzer -m model --measurement-request-count 30
*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Client: 
    Request count: 185
    Throughput: 37 infer/sec
    Avg latency: 26964 usec (standard deviation 4413 usec)
    p50 latency: 25359 usec
    p90 latency: 36882 usec
    p95 latency: 39795 usec
    p99 latency: 40857 usec
    Avg HTTP time: 27147 usec (send/recv 117 usec + response wait 27030 usec)
  Server: 
    Inference count: 221
    Execution count: 221
    Successful request count: 221
    Avg request latency: 26472 usec (overhead 33 usec + queue 10206 usec + compute input 199 usec + compute infer 14818 usec + compute output 1215 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 37 infer/sec, latency 26964 usec

As you can see, the step taking much more time is compute output.

During the benchmark, I see GPU utilization go to 100% but the power consumption stays < 50% of max power output !

Is that expected ? Seems very weird to me !

(I'm in fact using another custom model where the problem is even worse).

May 31 '22 16:05 dmenig

Interesting, Torch backend itself read the outputs from the framework in the same way regardless of the output data type, I wonder if you observe the same performance impact when running both models natively in PyTorch. You can also use nsight system to profile the time spent in detail.

Jun 01 '22 23:06 GuanLuo

In python torchscript the FP16 model is about the same speed as FP32. Same for pytorch.

Jun 03 '22 13:06 dmenig

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue

Nov 22 '22 03:11 jbkyang-nvi

Hi. The issue is still not resolved. So I think this issue should stay open.

Nov 28 '22 09:11 dmenig

I guess I'll create a new one then.

Nov 28 '22 10:11 dmenig

I've filed a ticket for us to investigate. Thank you kindly for your patience, Damien.

Nov 28 '22 19:11 the-david-oy

We've identified potential cause: a CPU overhead for small batch sizes causes FP16 model to be slower than FP32. More on this issue can be found here: https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/ We will continue working on it.

Feb 24 '23 21:02 oandreeva-nv