transformer-deploy big performance difference on tensorRT

Hi, I just tried the demo code below, in your result, the [TensorRT (FP16)] result is much better than others. However, the results I got are quite different. there is not such a big difference between [TensorRT (FP16)] and others (the output is attached). I wonder if you know what happened or how I can figure out the reason for that. Thank you.

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
  bash -c "cd /project && \
    convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
    --backend tensorrt onnx \
    --seq-len 16 128 128"

Inference done on Tesla M60
latencies:
[Pytorch (FP32)] mean=6.31ms, sd=1.32ms, min=4.48ms, max=10.75ms, median=6.39ms, 95p=8.63ms, 99p=9.33ms
[Pytorch (FP16)] mean=8.81ms, sd=2.02ms, min=6.59ms, max=55.42ms, median=8.70ms, 95p=11.20ms, 99p=12.16ms
**### [TensorRT (FP16)] mean=4.59ms, sd=1.97ms, min=2.27ms, max=10.38ms, median=4.47ms, 95p=8.02ms, 99p=8.86ms**
[ONNX Runtime (FP32)] mean=5.03ms, sd=2.00ms, min=2.64ms, max=10.45ms, median=5.16ms, 95p=8.37ms, 99p=9.17ms
[ONNX Runtime (optimized)] mean=5.19ms, sd=2.04ms, min=2.80ms, max=10.59ms, median=5.25ms, 95p=8.67ms, 99p=9.40ms

May 31 '22 12:05 HireezShanPeng

Also, I got this error while running the code above

Traceback (most recent call last):
  File "/usr/local/bin/convert_model", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 358, in entrypoint
    main(commands=args)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 329, in main
    check_accuracy(
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 82, in check_accuracy
    f"{engine_name} discrepency is too high ({discrepency:.2f} > {tolerance}):\n"
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 678, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

May 31 '22 13:05 HireezShanPeng

Hi @HireezShanPeng,

The big performance difference comes from RTX 3090 card used in the readme.

M60 doesn't support 16bit precision. There is no performance advantage of running a 16-bit model on M60.

logs on my 1080Ti, which also doesn't support 16-bit precision Screenshot 2022-06-01 at 1 07 09 PM

More information: https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/ https://www.tensorflow.org/guide/mixed_precision#supported_hardware https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8/

Jun 01 '22 07:06 kamalkraj

https://github.com/NVIDIA/TensorRT/issues/218

Jun 01 '22 07:06 kamalkraj

Thank you @kamalkraj for your answer. Just to complete:

when there is no tensor cores dedicated to FP16, mixed precision is usually slower as we add some casting (Fp32 <-> FP16) here and there involving plenty of tensor copies. It's mostly offset by the kernel fusions applied;
if you are doing cloud inference, T4 is a "cheap" option (regarding other GPU prices) which supports FP16. However, you need to keep in mind that recent GPUs have much more FP16 tensor cores than the good old T4 and the difference with FP32 is even higher.

Regarding your bug I am not sure to understand when it happens. Can you please provide more context?

Jun 02 '22 07:06 pommedeterresautee