TensorRT
TensorRT copied to clipboard
Tensorrt fp32 inference slower than pytorch on tesla T4 for GroundingDINO
Description
I convert groundingdino from torch to tensorrt on A100, which can accelarate 50% on inference. However, when I deploy the same model on T4, after I rebuild engine, the inference speed on tensorrt fp32 is slower than it on pytorch.
Environment
TensorRT Version:8.6.1.6
NVIDIA GPU:Tesla T4
NVIDIA Driver Version:535.129.03
CUDA Version:11.7
CUDNN Version:8.6
Operating System:
Python Version (if applicable):
Tensorflow Version (if applicable):
PyTorch Version (if applicable):1.8
Baremetal or Container (if so, version):
Relevant Files
Inference time for torch is around 650ms,
Info for building engine is as followed:
[01/18/2024-06:58:22] [I] === Model Options ===
[01/18/2024-06:58:22] [I] Format: ONNX
[01/18/2024-06:58:22] [I] Model: /workspace/groundingdino.onnx
[01/18/2024-06:58:22] [I] Output:
[01/18/2024-06:58:22] [I] === Build Options ===
[01/18/2024-06:58:22] [I] Max batch: explicit batch
[01/18/2024-06:58:22] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[01/18/2024-06:58:22] [I] minTiming: 1
[01/18/2024-06:58:22] [I] avgTiming: 8
[01/18/2024-06:58:22] [I] Precision: FP32
[01/18/2024-06:58:22] [I] LayerPrecisions:
[01/18/2024-06:58:22] [I] Layer Device Types:
[01/18/2024-06:58:22] [I] Calibration:
[01/18/2024-06:58:22] [I] Refit: Disabled
[01/18/2024-06:58:22] [I] Version Compatible: Disabled
[01/18/2024-06:58:22] [I] TensorRT runtime: full
[01/18/2024-06:58:22] [I] Lean DLL Path:
[01/18/2024-06:58:22] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[01/18/2024-06:58:22] [I] Exclude Lean Runtime: Disabled
[01/18/2024-06:58:22] [I] Sparsity: Disabled
[01/18/2024-06:58:22] [I] Safe mode: Disabled
[01/18/2024-06:58:22] [I] Build DLA standalone loadable: Disabled
[01/18/2024-06:58:22] [I] Allow GPU fallback for DLA: Disabled
[01/18/2024-06:58:22] [I] DirectIO mode: Disabled
[01/18/2024-06:58:22] [I] Restricted mode: Disabled
[01/18/2024-06:58:22] [I] Skip inference: Disabled
[01/18/2024-06:58:22] [I] Save engine: /workspace/groundingdino.trt
[01/18/2024-06:58:22] [I] Load engine:
[01/18/2024-06:58:22] [I] Profiling verbosity: 0
[01/18/2024-06:58:22] [I] Tactic sources: Using default tactic sources
[01/18/2024-06:58:22] [I] timingCacheMode: local
[01/18/2024-06:58:22] [I] timingCacheFile:
[01/18/2024-06:58:22] [I] Heuristic: Disabled
[01/18/2024-06:58:22] [I] Preview Features: Use default preview flags.
[01/18/2024-06:58:22] [I] MaxAuxStreams: -1
[01/18/2024-06:58:22] [I] BuilderOptimizationLevel: -1
[01/18/2024-06:58:22] [I] Input(s)s format: fp32:CHW
[01/18/2024-06:58:22] [I] Output(s)s format: fp32:CHW
[01/18/2024-06:58:22] [I] Input build shape: bert_output=1x1x768+1x6x768+1x256x768
[01/18/2024-06:58:22] [I] Input build shape: img=1x3x800x1200+1x3x800x1200+1x3x800x1200
[01/18/2024-06:58:22] [I] Input build shape: attention_mask=1x1+1x6+1x256
[01/18/2024-06:58:22] [I] Input build shape: position_ids=1x1+1x6+1x256
[01/18/2024-06:58:22] [I] Input build shape: object_mask=1x256+1x256+1x256
[01/18/2024-06:58:22] [I] Input build shape: text_token_mask=1x1x1+1x6x6+1x256x256
[01/18/2024-06:58:22] [I] Input calibration shapes: model
[01/18/2024-06:58:22] [I] === System Options ===
[01/18/2024-06:58:22] [I] Device: 0
[01/18/2024-06:58:22] [I] DLACore:
[01/18/2024-06:58:22] [I] Plugins:
[01/18/2024-06:58:22] [I] setPluginsToSerialize:
[01/18/2024-06:58:22] [I] dynamicPlugins:
[01/18/2024-06:58:22] [I] ignoreParsedPluginLibs: 0
[01/18/2024-06:58:22] [I]
[01/18/2024-06:58:22] [I] === Inference Options ===
[01/18/2024-06:58:22] [I] Batch: Explicit
[01/18/2024-06:58:22] [I] Input inference shape: text_token_mask=1x6x6
[01/18/2024-06:58:22] [I] Input inference shape: object_mask=1x256
[01/18/2024-06:58:22] [I] Input inference shape: position_ids=1x6
[01/18/2024-06:58:22] [I] Input inference shape: attention_mask=1x6
[01/18/2024-06:58:22] [I] Input inference shape: bert_output=1x6x768
[01/18/2024-06:58:22] [I] Input inference shape: img=1x3x800x1200
[01/18/2024-06:58:22] [I] Iterations: 10
[01/18/2024-06:58:22] [I] Duration: 3s (+ 200ms warm up)
[01/18/2024-06:58:22] [I] Sleep time: 0ms
[01/18/2024-06:58:22] [I] Idle time: 0ms
[01/18/2024-06:58:22] [I] Inference Streams: 1
[01/18/2024-06:58:22] [I] ExposeDMA: Disabled
[01/18/2024-06:58:22] [I] Data transfers: Enabled
[01/18/2024-06:58:22] [I] Spin-wait: Disabled
[01/18/2024-06:58:22] [I] Multithreading: Disabled
[01/18/2024-06:58:22] [I] CUDA Graph: Disabled
[01/18/2024-06:58:22] [I] Separate profiling: Disabled
[01/18/2024-06:58:22] [I] Time Deserialize: Disabled
[01/18/2024-06:58:22] [I] Time Refit: Disabled
[01/18/2024-06:58:22] [I] NVTX verbosity: 0
[01/18/2024-06:58:22] [I] Persistent Cache Ratio: 0
[01/18/2024-06:58:22] [I] Inputs:
[01/18/2024-06:58:22] [I] === Reporting Options ===
[01/18/2024-06:58:22] [I] Verbose: Disabled
[01/18/2024-06:58:22] [I] Averages: 10 inferences
[01/18/2024-06:58:22] [I] Percentiles: 90,95,99
[01/18/2024-06:58:22] [I] Dump refittable layers:Disabled
[01/18/2024-06:58:22] [I] Dump output: Disabled
[01/18/2024-06:58:22] [I] Profile: Disabled
[01/18/2024-06:58:22] [I] Export timing to JSON file:
[01/18/2024-06:58:22] [I] Export output to JSON file:
[01/18/2024-06:58:22] [I] Export profile to JSON file:
[01/18/2024-06:58:22] [I]
[01/18/2024-06:58:24] [I] === Device Information ===
[01/18/2024-06:58:24] [I] Selected Device: Tesla T4
[01/18/2024-06:58:24] [I] Compute Capability: 7.5
[01/18/2024-06:58:24] [I] SMs: 40
[01/18/2024-06:58:24] [I] Device Global Memory: 14930 MiB
[01/18/2024-06:58:24] [I] Shared Memory per SM: 64 KiB
[01/18/2024-06:58:24] [I] Memory Bus Width: 256 bits (ECC enabled)
[01/18/2024-06:58:24] [I] Application Compute Clock Rate: 1.59 GHz
[01/18/2024-06:58:24] [I] Application Memory Clock Rate: 5.001 GHz
[01/18/2024-06:58:24] [I]
[01/18/2024-06:58:24] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[01/18/2024-06:58:24] [I]
[01/18/2024-06:58:24] [I] TensorRT version: 8.6.1
[01/18/2024-06:58:24] [I] Loading standard plugins
[01/18/2024-06:58:25] [I] [TRT] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 19, GPU 105 (MiB)
[01/18/2024-06:58:32] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +896, GPU +174, now: CPU 991, GPU 279 (MiB)
[01/18/2024-06:58:32] [I] Start parsing network model.
[01/18/2024-06:58:32] [I] [TRT] ----------------------------------------------------------------
[01/18/2024-06:58:32] [I] [TRT] Input filename: /workspace/groundingdino.onnx
[01/18/2024-06:58:32] [I] [TRT] ONNX IR version: 0.0.8
[01/18/2024-06:58:32] [I] [TRT] Opset version: 16
[01/18/2024-06:58:32] [I] [TRT] Producer name: pytorch
[01/18/2024-06:58:32] [I] [TRT] Producer version: 1.13.1
[01/18/2024-06:58:32] [I] [TRT] Domain:
[01/18/2024-06:58:32] [I] [TRT] Model version: 0
[01/18/2024-06:58:32] [I] [TRT] Doc string:
[01/18/2024-06:58:32] [I] [TRT] ----------------------------------------------------------------
[01/18/2024-06:58:33] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[01/18/2024-06:58:33] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[01/18/2024-06:58:35] [I] Finished parsing network model. Parse time: 3.2296
[01/18/2024-06:58:35] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[01/18/2024-06:58:40] [I] [TRT] Graph optimization time: 3.76623 seconds.
[01/18/2024-06:58:40] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2470, GPU 559 (MiB)
[01/18/2024-06:58:40] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2470, GPU 569 (MiB)
[01/18/2024-06:58:40] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.6.0
[01/18/2024-06:58:40] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[01/18/2024-06:58:40] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[01/18/2024-07:02:38] [I] [TRT] Detected 6 inputs and 2 output network tensors.
[01/18/2024-07:02:42] [I] [TRT] Total Host Persistent Memory: 43424
[01/18/2024-07:02:42] [I] [TRT] Total Device Persistent Memory: 475648
[01/18/2024-07:02:42] [I] [TRT] Total Scratch Memory: 780636672
[01/18/2024-07:02:42] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 50 MiB, GPU 1109 MiB
[01/18/2024-07:02:42] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 343 steps to complete.
[01/18/2024-07:02:43] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 246.787ms to assign 54 blocks to 343 nodes requiring 942283264 bytes.
[01/18/2024-07:02:43] [I] [TRT] Total Activation Memory: 942278144
[01/18/2024-07:02:43] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3598, GPU 1187 (MiB)
[01/18/2024-07:02:43] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3598, GPU 1195 (MiB)
[01/18/2024-07:02:43] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.6.0
[01/18/2024-07:02:44] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +502, now: CPU 0, GPU 502 (MiB)
[01/18/2024-07:02:44] [I] Engine built in 259.893 sec.
[01/18/2024-07:02:45] [I] [TRT] Loaded engine size: 513 MiB
[01/18/2024-07:02:45] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2163, GPU 935 (MiB)
[01/18/2024-07:02:45] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 2164, GPU 943 (MiB)
[01/18/2024-07:02:45] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.6.0
[01/18/2024-07:02:45] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +501, now: CPU 0, GPU 501 (MiB)
[01/18/2024-07:02:45] [I] Engine deserialized in 0.493383 sec.
[01/18/2024-07:02:45] [I] [TRT] [MS] Running engine with multi stream info
[01/18/2024-07:02:45] [I] [TRT] [MS] Number of aux streams is 7
[01/18/2024-07:02:45] [I] [TRT] [MS] Number of total worker streams is 8
[01/18/2024-07:02:45] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[01/18/2024-07:02:45] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2163, GPU 935 (MiB)
[01/18/2024-07:02:46] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 2164, GPU 943 (MiB)
[01/18/2024-07:02:46] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.6.0
[01/18/2024-07:02:46] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +900, now: CPU 0, GPU 1401 (MiB)
[01/18/2024-07:02:46] [I] Setting persistentCacheLimit to 0 bytes.
[01/18/2024-07:02:46] [I] Using random values for input img
[01/18/2024-07:02:46] [I] Input binding for img with dimensions 1x3x800x1200 is created.
[01/18/2024-07:02:46] [I] Using random values for input bert_output
[01/18/2024-07:02:46] [I] Input binding for bert_output with dimensions 1x6x768 is created.
[01/18/2024-07:02:46] [I] Using random values for input attention_mask
[01/18/2024-07:02:46] [I] Input binding for attention_mask with dimensions 1x6 is created.
[01/18/2024-07:02:46] [I] Using random values for input position_ids
[01/18/2024-07:02:46] [I] Input binding for position_ids with dimensions 1x6 is created.
[01/18/2024-07:02:46] [I] Using random values for input text_token_mask
[01/18/2024-07:02:46] [I] Input binding for text_token_mask with dimensions 1x6x6 is created.
[01/18/2024-07:02:46] [I] Using random values for input object_mask
[01/18/2024-07:02:46] [I] Input binding for object_mask with dimensions 1x256 is created.
[01/18/2024-07:02:46] [I] Output binding for logits with dimensions 1x900x256 is created.
[01/18/2024-07:02:46] [I] Output binding for boxes with dimensions 1x900x4 is created.
[01/18/2024-07:02:46] [I] Starting inference
[01/18/2024-07:02:53] [I] Warmup completed 1 queries over 200 ms
[01/18/2024-07:02:53] [I] Timing trace has 10 queries over 6.24858 s
[01/18/2024-07:02:53] [I]
[01/18/2024-07:02:53] [I] === Trace details ===
[01/18/2024-07:02:53] [I] Trace averages of 10 runs:
[01/18/2024-07:02:53] [I] Average on 10 runs - GPU latency: 590.353 ms - Host latency: 592.789 ms (enqueue 589.294 ms)
[01/18/2024-07:02:53] [I]
[01/18/2024-07:02:53] [I] === Performance summary ===
[01/18/2024-07:02:53] [I] Throughput: 1.60036 qps
[01/18/2024-07:02:53] [I] Latency: min = 588.442 ms, max = 595.703 ms, mean = 592.789 ms, median = 592.715 ms, percentile(90%) = 595.005 ms, percentile(95%) = 595.703 ms, percentile(99%) = 595.703 ms
[01/18/2024-07:02:53] [I] Enqueue Time: min = 576.717 ms, max = 593.965 ms, mean = 589.294 ms, median = 590.206 ms, percentile(90%) = 593.607 ms, percentile(95%) = 593.965 ms, percentile(99%) = 593.965 ms
[01/18/2024-07:02:53] [I] H2D Latency: min = 2.2395 ms, max = 2.43123 ms, mean = 2.26939 ms, median = 2.25073 ms, percentile(90%) = 2.26904 ms, percentile(95%) = 2.43123 ms, percentile(99%) = 2.43123 ms
[01/18/2024-07:02:53] [I] GPU Compute Time: min = 586.03 ms, max = 593.103 ms, mean = 590.353 ms, median = 590.308 ms, percentile(90%) = 592.596 ms, percentile(95%) = 593.103 ms, percentile(99%) = 593.103 ms
[01/18/2024-07:02:53] [I] D2H Latency: min = 0.147949 ms, max = 0.171143 ms, mean = 0.166638 ms, median = 0.168518 ms, percentile(90%) = 0.170166 ms, percentile(95%) = 0.171143 ms, percentile(99%) = 0.171143 ms
[01/18/2024-07:02:53] [I] Total Host Walltime: 6.24858 s
[01/18/2024-07:02:53] [I] Total GPU Compute Time: 5.90353 s
[01/18/2024-07:02:53] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[01/18/2024-07:02:53] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[01/18/2024-07:02:53] [I] Explanations of the performance metrics are printed in the verbose logs.
[01/18/2024-07:02:53] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # ./trtexec --onnx=/workspace/groundingdino.onnx --saveEngine=/workspace/groundingdino.trt --minShapes=img:1x3x800x1200,bert_output:1x1x768,attention_mask:1x1,position_ids:1x1,text_token_mask:1x1x1,object_mask:1x256 --optShapes=img:1x3x800x1200,bert_output:1x6x768,attention_mask:1x6,position_ids:1x6,text_token_mask:1x6x6,object_mask:1x256 --maxShapes=img:1x3x800x1200,bert_output:1x256x768,attention_mask:1x256,position_ids:1x256,text_token_mask:1x256x256,object_mask:1x256
@nvpohanh Is this expected? (torch 650ms vs trt 590.308 ms)
T4 is pretty old GPU, maybe we just don't have much optimized kernel for it?
Different ai-frameworks on different arch gpu have different layer kernel impl.
closing since no activity for more than 3 weeks per our policy, thanks all!