TensorRT Inference slower on A40 then A30

Description

I am moving from A30 to A40. So I needed to rebuild my onnx model for A40. I rebuilt using the same trtexec version, the same command and the same model via the docker image as I did on A30. The image: nvcr.io/nvidia/tensorrt:24.06-py3 The command:

trtexec --onnx=model.onnx \
        --maxShapes=input:4x3x1024x1024 \
        --minShapes=input:1x3x1024x1024 \
        --optShapes=input:2x3x1024x1024 \
        --fp16 \
        --saveEngine=model.plan

I benchmark my models on both GPUs using Triton Inference Server 2.47.0 and get: A30:

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 15.4337 infer/sec, latency 74841 usec
Concurrency: 2, throughput: 30.3723 infer/sec, latency 77563 usec
Concurrency: 3, throughput: 35.0317 infer/sec, latency 94443 usec
Concurrency: 4, throughput: 37.0215 infer/sec, latency 132680 usec

A40:

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 12.2598 infer/sec, latency 88722 usec
Concurrency: 2, throughput: 24.3012 infer/sec, latency 82894 usec
Concurrency: 3, throughput: 28.9309 infer/sec, latency 104551 usec
Concurrency: 4, throughput: 30.2839 infer/sec, latency 160710 usec

Environment

TensorRT Version: 10.1.0.27

NVIDIA GPU: A40

NVIDIA Driver Version: 555.58.02

CUDA Version: 12.1

Operating System: Ubuntu 22.04

Jul 31 '24 18:07 decadance-dance

Try to add --builderOptimizationLevel=5.

Aug 01 '24 05:08 lix19937

@lix19937 I added this flag but got:

[08/01/2024-08:23:19] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 68727865856 detected for tactic 0x0000000000000018.
[08/01/2024-08:23:20] [W] [TRT] Tactic Device request: 65544MB Available: 45525MB. Device memory is insufficient to use tactic.
[08/01/2024-08:23:20] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 1 due to insufficient memory on requested size of 68727865856 detected for tactic 0x0000000000000019.
[08/01/2024-08:23:20] [W] [TRT] Tactic Device request: 65544MB Available: 45525MB. Device memory is insufficient to use tactic.
[08/01/2024-08:23:20] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 2 due to insufficient memory on requested size of 68727865856 detected for tactic 0x000000000000001a.
[08/01/2024-08:23:20] [W] [TRT] Tactic Device request: 65544MB Available: 45525MB. Device memory is insufficient to use tactic.
[08/01/2024-08:23:20] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 3 due to insufficient memory on requested size of 68727865856 detected for tactic 0x000000000000001b.
[08/01/2024-08:23:20] [W] [TRT] Tactic Device request: 65544MB Available: 45525MB. Device memory is insufficient to use tactic.
[08/01/2024-08:23:20] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 4 due to insufficient memory on requested size of 68727865856 detected for tactic 0x000000000000001f.

Why 45GB VRAM is insufficient?

Aug 01 '24 08:08 decadance-dance

@lix19937 despite the issues associated with GPU memory, I rebuilt the model with --builderOptimizationLevel=5 but got quite close results:

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 12.5453 infer/sec, latency 81882 usec
Concurrency: 2, throughput: 24.5096 infer/sec, latency 95886 usec
Concurrency: 3, throughput: 28.1778 infer/sec, latency 109527 usec
Concurrency: 4, throughput: 29.4522 infer/sec, latency 168369 usec

So I think either it didn't work at all or the presence of issues affected the result.

Aug 01 '24 08:08 decadance-dance

Why 45GB VRAM is insufficient?

Yes , You can try to add workspace size.

Aug 01 '24 10:08 lix19937

BTW, diff hardwares with A30 A40， diff computational power, you should keep the freq stable, and compare the power supply, and use nsight system tools to profile the resource utilize.

Aug 01 '24 11:08 lix19937

@decadance-dance can you retry with nvcr.io/nvidia/tensorrt:25.01-py3 to see if this is still an issue?

Feb 11 '25 16:02 brnguyen2

@decadance-dance Why would you expect A40 to be faster than A30?

Based on the spec:

A30 dense FP16 FLOPS: 165 TFLOPS
A40 dense FP16 FLOPS: 149.7 TFLOPS

A30 is supposed to be 10% faster than A40.

Mar 04 '25 00:03 nvpohanh

I am going to close this for now since A40 is expected to be slower than A30. Please reopen if you still have other questions. Thanks

Mar 06 '25 03:03 nvpohanh