Inference slower on A40 then A30
Description
I am moving from A30 to A40. So I needed to rebuild my onnx model for A40.
I rebuilt using the same trtexec version, the same command and the same model via the docker image as I did on A30.
The image: nvcr.io/nvidia/tensorrt:24.06-py3
The command:
trtexec --onnx=model.onnx \
--maxShapes=input:4x3x1024x1024 \
--minShapes=input:1x3x1024x1024 \
--optShapes=input:2x3x1024x1024 \
--fp16 \
--saveEngine=model.plan
I benchmark my models on both GPUs using Triton Inference Server 2.47.0 and get: A30:
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 15.4337 infer/sec, latency 74841 usec
Concurrency: 2, throughput: 30.3723 infer/sec, latency 77563 usec
Concurrency: 3, throughput: 35.0317 infer/sec, latency 94443 usec
Concurrency: 4, throughput: 37.0215 infer/sec, latency 132680 usec
A40:
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 12.2598 infer/sec, latency 88722 usec
Concurrency: 2, throughput: 24.3012 infer/sec, latency 82894 usec
Concurrency: 3, throughput: 28.9309 infer/sec, latency 104551 usec
Concurrency: 4, throughput: 30.2839 infer/sec, latency 160710 usec
Environment
TensorRT Version: 10.1.0.27
NVIDIA GPU: A40
NVIDIA Driver Version: 555.58.02
CUDA Version: 12.1
Operating System: Ubuntu 22.04
Try to add --builderOptimizationLevel=5.
@lix19937 I added this flag but got:
[08/01/2024-08:23:19] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 68727865856 detected for tactic 0x0000000000000018.
[08/01/2024-08:23:20] [W] [TRT] Tactic Device request: 65544MB Available: 45525MB. Device memory is insufficient to use tactic.
[08/01/2024-08:23:20] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 1 due to insufficient memory on requested size of 68727865856 detected for tactic 0x0000000000000019.
[08/01/2024-08:23:20] [W] [TRT] Tactic Device request: 65544MB Available: 45525MB. Device memory is insufficient to use tactic.
[08/01/2024-08:23:20] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 2 due to insufficient memory on requested size of 68727865856 detected for tactic 0x000000000000001a.
[08/01/2024-08:23:20] [W] [TRT] Tactic Device request: 65544MB Available: 45525MB. Device memory is insufficient to use tactic.
[08/01/2024-08:23:20] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 3 due to insufficient memory on requested size of 68727865856 detected for tactic 0x000000000000001b.
[08/01/2024-08:23:20] [W] [TRT] Tactic Device request: 65544MB Available: 45525MB. Device memory is insufficient to use tactic.
[08/01/2024-08:23:20] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 4 due to insufficient memory on requested size of 68727865856 detected for tactic 0x000000000000001f.
Why 45GB VRAM is insufficient?
@lix19937
despite the issues associated with GPU memory, I rebuilt the model with --builderOptimizationLevel=5 but got quite close results:
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 12.5453 infer/sec, latency 81882 usec
Concurrency: 2, throughput: 24.5096 infer/sec, latency 95886 usec
Concurrency: 3, throughput: 28.1778 infer/sec, latency 109527 usec
Concurrency: 4, throughput: 29.4522 infer/sec, latency 168369 usec
So I think either it didn't work at all or the presence of issues affected the result.
Why 45GB VRAM is insufficient?
Yes , You can try to add workspace size.
BTW, diff hardwares with A30 A40, diff computational power, you should keep the freq stable, and compare the power supply, and use nsight system tools to profile the resource utilize.
@decadance-dance can you retry with nvcr.io/nvidia/tensorrt:25.01-py3 to see if this is still an issue?
@decadance-dance Why would you expect A40 to be faster than A30?
Based on the spec:
- A30 dense FP16 FLOPS: 165 TFLOPS
- A40 dense FP16 FLOPS: 149.7 TFLOPS
A30 is supposed to be 10% faster than A40.
I am going to close this for now since A40 is expected to be slower than A30. Please reopen if you still have other questions. Thanks