TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

some questions about trtexec's batch size and workspace

Open EmmaThompson123 opened this issue 1 year ago • 4 comments

I have a onnx model which has two input and one output with dynamic shape: Image I convert it to trt model with trtexec with command:trtexec --onnx=weights/wav2lip/wav2lip.onnx --saveEngine=weights/wav2lip/wav2lip_2048_minoptmax.trt --workspace=2048 --minShapes=" audio_seqs__0:1x1x80x16, img_seqs__1:1x6x256x256" --optShapes="audio_seqs__0:4x1x80x16, img_seqs__1:4x6x256x256" --maxShapes="audio_seqs__0:8x1x80x16, img_seqs__1:8x6x256x256" This is its log:

&&&& RUNNING TensorRT.trtexec [TensorRT v8003] # trtexec --onnx=weights/wav2lip/wav2lip.onnx --saveEngine=weights/wav2lip/wav2lip_2048_minoptmax.trt --workspace=2048 --minShapes=audio_seqs__0:1x1x80x16, img_seqs__1:1x6x256x256 --optShapes=audio_seqs__0:4x1x80x16, img_seqs__1:4x6x256x256 --maxShapes=audio_seqs__0:8x1x80x16, img_seqs__1:8x6x256x256
[09/19/2024-22:14:12] [I] === Model Options ===
[09/19/2024-22:14:12] [I] Format: ONNX
[09/19/2024-22:14:12] [I] Model: weights/wav2lip/wav2lip.onnx
[09/19/2024-22:14:12] [I] Output:
[09/19/2024-22:14:12] [I] === Build Options ===
[09/19/2024-22:14:12] [I] Max batch: explicit
[09/19/2024-22:14:12] [I] Workspace: 2048 MiB
[09/19/2024-22:14:12] [I] minTiming: 1
[09/19/2024-22:14:12] [I] avgTiming: 8
[09/19/2024-22:14:12] [I] Precision: FP32
[09/19/2024-22:14:12] [I] Calibration: 
[09/19/2024-22:14:12] [I] Refit: Disabled
[09/19/2024-22:14:12] [I] Sparsity: Disabled
[09/19/2024-22:14:12] [I] Safe mode: Disabled
[09/19/2024-22:14:12] [I] Restricted mode: Disabled
[09/19/2024-22:14:12] [I] Save engine: weights/wav2lip/wav2lip_2048_minoptmax.trt
[09/19/2024-22:14:12] [I] Load engine: 
[09/19/2024-22:14:12] [I] NVTX verbosity: 0
[09/19/2024-22:14:12] [I] Tactic sources: Using default tactic sources
[09/19/2024-22:14:12] [I] timingCacheMode: local
[09/19/2024-22:14:12] [I] timingCacheFile: 
[09/19/2024-22:14:12] [I] Input(s)s format: fp32:CHW
[09/19/2024-22:14:12] [I] Output(s)s format: fp32:CHW
[09/19/2024-22:14:12] [I] Input build shape:  img_seqs__1=1x6x256x256+4x6x256x256+8x6x256x256
[09/19/2024-22:14:12] [I] Input build shape: audio_seqs__0=1x1x80x16+4x1x80x16+8x1x80x16
[09/19/2024-22:14:12] [I] Input calibration shapes: model
[09/19/2024-22:14:12] [I] === System Options ===
[09/19/2024-22:14:12] [I] Device: 0
[09/19/2024-22:14:12] [I] DLACore: 
[09/19/2024-22:14:12] [I] Plugins:
[09/19/2024-22:14:12] [I] === Inference Options ===
[09/19/2024-22:14:12] [I] Batch: Explicit
[09/19/2024-22:14:12] [I] Input inference shape: audio_seqs__0=4x1x80x16
[09/19/2024-22:14:12] [I] Input inference shape:  img_seqs__1=4x6x256x256
[09/19/2024-22:14:12] [I] Iterations: 10
[09/19/2024-22:14:12] [I] Duration: 3s (+ 200ms warm up)
[09/19/2024-22:14:12] [I] Sleep time: 0ms
[09/19/2024-22:14:12] [I] Streams: 1
[09/19/2024-22:14:12] [I] ExposeDMA: Disabled
[09/19/2024-22:14:12] [I] Data transfers: Enabled
[09/19/2024-22:14:12] [I] Spin-wait: Disabled
[09/19/2024-22:14:12] [I] Multithreading: Disabled
[09/19/2024-22:14:12] [I] CUDA Graph: Disabled
[09/19/2024-22:14:12] [I] Separate profiling: Disabled
[09/19/2024-22:14:12] [I] Time Deserialize: Disabled
[09/19/2024-22:14:12] [I] Time Refit: Disabled
[09/19/2024-22:14:12] [I] Skip inference: Disabled
[09/19/2024-22:14:12] [I] Inputs:
[09/19/2024-22:14:12] [I] === Reporting Options ===
[09/19/2024-22:14:12] [I] Verbose: Disabled
[09/19/2024-22:14:12] [I] Averages: 10 inferences
[09/19/2024-22:14:12] [I] Percentile: 99
[09/19/2024-22:14:12] [I] Dump refittable layers:Disabled
[09/19/2024-22:14:12] [I] Dump output: Disabled
[09/19/2024-22:14:12] [I] Profile: Disabled
[09/19/2024-22:14:12] [I] Export timing to JSON file: 
[09/19/2024-22:14:12] [I] Export output to JSON file: 
[09/19/2024-22:14:12] [I] Export profile to JSON file: 
[09/19/2024-22:14:12] [I] 
[09/19/2024-22:14:12] [I] === Device Information ===
[09/19/2024-22:14:12] [I] Selected Device: NVIDIA A30
[09/19/2024-22:14:12] [I] Compute Capability: 8.0
[09/19/2024-22:14:12] [I] SMs: 56
[09/19/2024-22:14:12] [I] Compute Clock Rate: 1.44 GHz
[09/19/2024-22:14:12] [I] Device Global Memory: 24060 MiB
[09/19/2024-22:14:12] [I] Shared Memory per SM: 164 KiB
[09/19/2024-22:14:12] [I] Memory Bus Width: 3072 bits (ECC enabled)
[09/19/2024-22:14:12] [I] Memory Clock Rate: 1.215 GHz
[09/19/2024-22:14:12] [I] 
[09/19/2024-22:14:12] [I] TensorRT version: 8003
[09/19/2024-22:14:14] [I] [TRT] [MemUsageChange] Init CUDA: CPU +502, GPU +0, now: CPU 509, GPU 521 (MiB)
[09/19/2024-22:14:14] [I] Start parsing network model
[09/19/2024-22:14:15] [I] [TRT] ----------------------------------------------------------------
[09/19/2024-22:14:15] [I] [TRT] Input filename:   weights/wav2lip/wav2lip.onnx
[09/19/2024-22:14:15] [I] [TRT] ONNX IR version:  0.0.7
[09/19/2024-22:14:15] [I] [TRT] Opset version:    9
[09/19/2024-22:14:15] [I] [TRT] Producer name:    pytorch
[09/19/2024-22:14:15] [I] [TRT] Producer version: 1.10
[09/19/2024-22:14:15] [I] [TRT] Domain:           
[09/19/2024-22:14:15] [I] [TRT] Model version:    0
[09/19/2024-22:14:15] [I] [TRT] Doc string:       
[09/19/2024-22:14:15] [I] [TRT] ----------------------------------------------------------------
[09/19/2024-22:14:15] [I] Finish parsing network model
[09/19/2024-22:14:15] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 714, GPU 521 (MiB)
[09/19/2024-22:14:15] [W] Dynamic dimensions required for input: img_seqs__1, but no shapes were provided. Automatically overriding shape to: 1x6x256x256
[09/19/2024-22:14:15] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 714 MiB, GPU 521 MiB
[09/19/2024-22:14:19] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +793, GPU +342, now: CPU 1592, GPU 863 (MiB)
[09/19/2024-22:14:24] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +198, GPU +342, now: CPU 1790, GPU 1205 (MiB)
[09/19/2024-22:14:24] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[09/19/2024-22:15:03] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[09/19/2024-22:15:33] [I] [TRT] Detected 2 inputs and 1 output network tensors.
[09/19/2024-22:15:34] [I] [TRT] Total Host Persistent Memory: 170400
[09/19/2024-22:15:34] [I] [TRT] Total Device Persistent Memory: 125895680
[09/19/2024-22:15:34] [I] [TRT] Total Scratch Memory: 59641344
[09/19/2024-22:15:34] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 116 MiB, GPU 4 MiB
[09/19/2024-22:15:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2493, GPU 1807 (MiB)
[09/19/2024-22:15:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2493, GPU 1815 (MiB)
[09/19/2024-22:15:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2493, GPU 1799 (MiB)
[09/19/2024-22:15:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2493, GPU 1781 (MiB)
[09/19/2024-22:15:34] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 2493 MiB, GPU 1781 MiB
[09/19/2024-22:15:35] [I] [TRT] Loaded engine size: 245 MB
[09/19/2024-22:15:35] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 2653 MiB, GPU 1535 MiB
[09/19/2024-22:15:36] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 2654, GPU 1791 (MiB)
[09/19/2024-22:15:36] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2654, GPU 1799 (MiB)
[09/19/2024-22:15:36] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2654, GPU 1781 (MiB)
[09/19/2024-22:15:36] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 2654 MiB, GPU 1781 MiB
[09/19/2024-22:15:38] [I] Engine built in 86.0558 sec.
[09/19/2024-22:15:38] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 2202 MiB, GPU 1781 MiB
[09/19/2024-22:15:38] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 2202, GPU 1791 (MiB)
[09/19/2024-22:15:38] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 2203, GPU 1799 (MiB)
[09/19/2024-22:15:38] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 2203 MiB, GPU 2029 MiB
[09/19/2024-22:15:38] [I] Created input binding for audio_seqs__0 with dimensions 1x1x80x16
[09/19/2024-22:15:38] [I] Created input binding for img_seqs__1 with dimensions 1x6x256x256
[09/19/2024-22:15:38] [I] Created output binding for value__0 with dimensions 1x3x256x256
[09/19/2024-22:15:38] [I] Starting inference
[09/19/2024-22:15:41] [I] Warmup completed 25 queries over 200 ms
[09/19/2024-22:15:41] [I] Timing trace has 437 queries over 3.01541 s
[09/19/2024-22:15:41] [I] 
[09/19/2024-22:15:41] [I] === Trace details ===
[09/19/2024-22:15:41] [I] Trace averages of 10 runs:
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.14592 ms - Host latency: 6.26049 ms (end to end 12.2136 ms, enqueue 1.56614 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.10812 ms - Host latency: 6.22248 ms (end to end 12.1529 ms, enqueue 1.66944 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.07657 ms - Host latency: 6.19256 ms (end to end 11.9598 ms, enqueue 1.79318 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.10353 ms - Host latency: 6.21867 ms (end to end 12.1297 ms, enqueue 1.52819 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.14221 ms - Host latency: 6.25702 ms (end to end 12.1781 ms, enqueue 1.65053 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.126 ms - Host latency: 6.2387 ms (end to end 12.1926 ms, enqueue 1.53453 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.07675 ms - Host latency: 6.18932 ms (end to end 12.0923 ms, enqueue 1.51494 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.07515 ms - Host latency: 6.18792 ms (end to end 12.092 ms, enqueue 1.62488 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.10464 ms - Host latency: 6.2211 ms (end to end 10.2946 ms, enqueue 1.6262 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.07413 ms - Host latency: 6.18934 ms (end to end 12.0712 ms, enqueue 1.57864 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.07213 ms - Host latency: 6.18408 ms (end to end 12.0721 ms, enqueue 1.50838 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.07623 ms - Host latency: 6.18799 ms (end to end 12.0848 ms, enqueue 1.4942 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.07281 ms - Host latency: 6.18541 ms (end to end 12.0798 ms, enqueue 1.55894 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.0995 ms - Host latency: 6.21367 ms (end to end 12.119 ms, enqueue 1.55039 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.13557 ms - Host latency: 6.25006 ms (end to end 12.2051 ms, enqueue 1.70592 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.08328 ms - Host latency: 6.19482 ms (end to end 12.108 ms, enqueue 1.50658 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.07456 ms - Host latency: 6.19028 ms (end to end 12.073 ms, enqueue 1.63972 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.08335 ms - Host latency: 6.20088 ms (end to end 12.0745 ms, enqueue 2.13264 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.14294 ms - Host latency: 6.26263 ms (end to end 12.1738 ms, enqueue 1.70513 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.09976 ms - Host latency: 6.23215 ms (end to end 10.6229 ms, enqueue 2.17544 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.12134 ms - Host latency: 6.24631 ms (end to end 11.0738 ms, enqueue 1.90903 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.11261 ms - Host latency: 6.22388 ms (end to end 11.1616 ms, enqueue 0.869983 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.13969 ms - Host latency: 6.25413 ms (end to end 10.737 ms, enqueue 1.2068 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.11079 ms - Host latency: 6.22406 ms (end to end 11.5116 ms, enqueue 1.17863 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.09539 ms - Host latency: 6.20665 ms (end to end 10.6648 ms, enqueue 1.31162 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.14495 ms - Host latency: 6.25798 ms (end to end 11.9712 ms, enqueue 1.41851 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.14591 ms - Host latency: 6.26588 ms (end to end 9.82986 ms, enqueue 2.21858 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.10359 ms - Host latency: 6.21925 ms (end to end 9.82765 ms, enqueue 1.67719 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.07402 ms - Host latency: 6.18718 ms (end to end 11.9682 ms, enqueue 1.5032 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.09145 ms - Host latency: 6.20684 ms (end to end 12.0853 ms, enqueue 1.93337 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.14033 ms - Host latency: 6.25259 ms (end to end 12.2141 ms, enqueue 1.51523 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.14177 ms - Host latency: 6.25718 ms (end to end 10.9604 ms, enqueue 1.66836 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.0825 ms - Host latency: 6.19624 ms (end to end 12.107 ms, enqueue 1.57991 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.12781 ms - Host latency: 6.24231 ms (end to end 12.1774 ms, enqueue 1.67471 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.13193 ms - Host latency: 6.24639 ms (end to end 12.1468 ms, enqueue 1.6769 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.12966 ms - Host latency: 6.24275 ms (end to end 11.5934 ms, enqueue 1.69204 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.13918 ms - Host latency: 6.25156 ms (end to end 11.401 ms, enqueue 1.52246 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.13882 ms - Host latency: 6.25691 ms (end to end 12.1839 ms, enqueue 2.05549 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.13804 ms - Host latency: 6.25203 ms (end to end 12.1769 ms, enqueue 1.42085 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.13926 ms - Host latency: 6.25642 ms (end to end 12.1934 ms, enqueue 1.71943 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.13928 ms - Host latency: 6.25256 ms (end to end 12.1998 ms, enqueue 1.67085 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.14385 ms - Host latency: 6.25491 ms (end to end 12.1301 ms, enqueue 1.40862 ms)
[09/19/2024-22:15:41] [I] Average on 10 runs - GPU latency: 6.13625 ms - Host latency: 6.24854 ms (end to end 12.1949 ms, enqueue 1.58989 ms)
[09/19/2024-22:15:41] [I] 
[09/19/2024-22:15:41] [I] === Performance summary ===
[09/19/2024-22:15:41] [I] Throughput: 144.922 qps
[09/19/2024-22:15:41] [I] Latency: min = 6.17139 ms, max = 6.48535 ms, mean = 6.22789 ms, median = 6.24438 ms, percentile(99%) = 6.34253 ms
[09/19/2024-22:15:41] [I] End-to-End Host Latency: min = 6.20459 ms, max = 12.3419 ms, mean = 11.763 ms, median = 12.0958 ms, percentile(99%) = 12.2358 ms
[09/19/2024-22:15:41] [I] Enqueue Time: min = 0.611816 ms, max = 5.16602 ms, mean = 1.61378 ms, median = 1.51978 ms, percentile(99%) = 2.65259 ms
[09/19/2024-22:15:41] [I] H2D Latency: min = 0.0717773 ms, max = 0.231445 ms, mean = 0.0787646 ms, median = 0.076416 ms, percentile(99%) = 0.0963287 ms
[09/19/2024-22:15:41] [I] GPU Compute Time: min = 6.05969 ms, max = 6.35608 ms, mean = 6.11304 ms, median = 6.13123 ms, percentile(99%) = 6.21082 ms
[09/19/2024-22:15:41] [I] D2H Latency: min = 0.0345459 ms, max = 0.0378418 ms, mean = 0.0360786 ms, median = 0.0358887 ms, percentile(99%) = 0.0376587 ms
[09/19/2024-22:15:41] [I] Total Host Walltime: 3.01541 s
[09/19/2024-22:15:41] [I] Total GPU Compute Time: 2.6714 s
[09/19/2024-22:15:41] [I] Explanations of the performance metrics are printed in the verbose logs.
[09/19/2024-22:15:41] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8003] # trtexec --onnx=weights/wav2lip/wav2lip.onnx --saveEngine=weights/wav2lip/wav2lip_2048_minoptmax.trt --workspace=2048 --minShapes=audio_seqs__0:1x1x80x16, img_seqs__1:1x6x256x256 --optShapes=audio_seqs__0:4x1x80x16, img_seqs__1:4x6x256x256 --maxShapes=audio_seqs__0:8x1x80x16, img_seqs__1:8x6x256x256
[09/19/2024-22:15:41] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2202, GPU 1889 (MiB)

My questions are:

  1. why I have set --minShapes, --optShapes, --maxShapes, the log still says "Dynamic dimensions required for input: img_seqs__1, but no shapes were provided. Automatically overriding shape to: 1x6x256x256", and I print out engine.max_batch_size, it output 1, shouldn't the max batch size be 8 ? then how to make the engine.max_batch_size equals 8 ?
  2. How to properly set workspace ? I don't want to set it too large because I want to run as many model inferences as possible on a single GPU to reduce inference costs in high-concurrency scenarios. On the other hand, if workspace be set too small, it will slow the inference speed. This was my initial understanding. Can we get some clues from MemUsageChange? If so, based on the information in the last line of the log, the maximum memory usage reached 1889 MiB. So I guess the proper workspace is a little larger than 1889 MB, e.g. 2048 MB. In theory, I've set the workspace to 4096, which is greater than 1889, but I still see the log displaying "Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output". From this, it seems that even with a larger workspace setting, there are still warnings about insufficient memory. Additionally, I noticed that even when the workspace is set to 1024, during inference, GPU memory monitoring still shows usage reaching around 1800 MiB. This indicates that setting a smaller workspace does not limit the maximum GPU memory usage.

EmmaThompson123 avatar Sep 20 '24 07:09 EmmaThompson123

Could you try a newer TRT version? I believe issue (1) has been fixed in the latest TRT version. Thanks

nvpohanh avatar Sep 20 '24 08:09 nvpohanh

Could you try a newer TRT version? I believe issue (1) has been fixed in the latest TRT version. Thanks

OK, issue (1) is not a big problem, there is workaround, that is manually set max_batch_size in def allocate_buffers(engine): ~

EmmaThompson123 avatar Sep 20 '24 10:09 EmmaThompson123

This interface has many properties that you can set to control how TensorRT optimizes the network. One important property is the maximum workspace size. Layer implementations often require a temporary workspace, and this parameter limits the maximum size that any layer in the network can use. If insufficient workspace is provided, it is possible that TensorRT will not be able to find an implementation for a layer. By default, the workspace is set to the total global memory size of the given device; restrict it when necessary, for example, when multiple engines are to be built on a single device.

During the build, TensorRT allocates device memory for timing layer implementations. Some implementations can consume a lot of temporary memory, especially with large tensors. You can control the maximum amount of temporary memory through the memory pool limits of the builder config. The workspace size defaults to the full size of the device's global memory but can be restricted when necessary. If the builder finds applicable kernels that could not be run because of insufficient workspace, it will emit a logging message indicating this. Even with relatively little workspace, however, timing requires creating buffers for input, output, and weights. TensorRT is robust against the operating system (OS) returning out-of-memory for such allocations. On some platforms, the OS may successfully provide memory, and then the out-of-memory killer process observes that the system is low on memory and kills TensorRT. If this happens, free up as much system memory as possible before retrying.

ref https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#build_engine_c

lix19937 avatar Oct 05 '24 13:10 lix19937

How to make the workspace truly take effect, i.e. strictly limiting the maximum GPU memory usage below the workspace value. I want to ensure that at any time, in any scenario, the GPU memory usage never exceeds the workspace value. Otherwise, it may lead to OOM and cloud service crashes.

EmmaThompson123 avatar Oct 18 '24 11:10 EmmaThompson123