big performance difference on tensorRT
Hi, I just tried the demo code below, in your result, the [TensorRT (FP16)] result is much better than others. However, the results I got are quite different. there is not such a big difference between [TensorRT (FP16)] and others (the output is attached). I wonder if you know what happened or how I can figure out the reason for that. Thank you.
docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
bash -c "cd /project && \
convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
--backend tensorrt onnx \
--seq-len 16 128 128"
Inference done on Tesla M60
latencies:
[Pytorch (FP32)] mean=6.31ms, sd=1.32ms, min=4.48ms, max=10.75ms, median=6.39ms, 95p=8.63ms, 99p=9.33ms
[Pytorch (FP16)] mean=8.81ms, sd=2.02ms, min=6.59ms, max=55.42ms, median=8.70ms, 95p=11.20ms, 99p=12.16ms
**### [TensorRT (FP16)] mean=4.59ms, sd=1.97ms, min=2.27ms, max=10.38ms, median=4.47ms, 95p=8.02ms, 99p=8.86ms**
[ONNX Runtime (FP32)] mean=5.03ms, sd=2.00ms, min=2.64ms, max=10.45ms, median=5.16ms, 95p=8.37ms, 99p=9.17ms
[ONNX Runtime (optimized)] mean=5.19ms, sd=2.04ms, min=2.80ms, max=10.59ms, median=5.25ms, 95p=8.67ms, 99p=9.40ms
Also, I got this error while running the code above
Traceback (most recent call last):
File "/usr/local/bin/convert_model", line 8, in <module>
sys.exit(entrypoint())
File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 358, in entrypoint
main(commands=args)
File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 329, in main
check_accuracy(
File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 82, in check_accuracy
f"{engine_name} discrepency is too high ({discrepency:.2f} > {tolerance}):\n"
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 678, in __array__
return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
Hi @HireezShanPeng,
The big performance difference comes from RTX 3090 card used in the readme.
M60 doesn't support 16bit precision. There is no performance advantage of running a 16-bit model on M60.
logs on my 1080Ti, which also doesn't support 16-bit precision

More information: https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/ https://www.tensorflow.org/guide/mixed_precision#supported_hardware https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8/
https://github.com/NVIDIA/TensorRT/issues/218
Thank you @kamalkraj for your answer. Just to complete:
- when there is no tensor cores dedicated to FP16, mixed precision is usually slower as we add some casting (Fp32 <-> FP16) here and there involving plenty of tensor copies. It's mostly offset by the kernel fusions applied;
- if you are doing cloud inference, T4 is a "cheap" option (regarding other GPU prices) which supports FP16. However, you need to keep in mind that recent GPUs have much more FP16 tensor cores than the good old T4 and the difference with FP32 is even higher.
Regarding your bug I am not sure to understand when it happens. Can you please provide more context?