What is the relationship between whether or not sparse implementation is picked for TRT inference?
Description
I used apex's ASP to accomplish the N:M sparsity. I get the following result when I use ./trtexec --onnx=sparse.onnx --saveEngine=sparse.trt --sparsity=enable --fp16 --verbose to generate the engine:
[08/05/2024-12:06:54] [I] [TRT] (Sparsity) Layers eligible for sparse math: MatMul_401, Conv_619 + Relu_620, Conv_622 + Relu_623, Conv_235 + Relu_236, Conv_626, Conv_239, Conv_621 + Add_627 + Relu_628, Conv_234 + Add_240 + Relu_241, Conv_243 + Relu_244, Conv_630 + Relu_631, Conv_247, Conv_634, Conv_242 + Add_248 + Relu_249, Conv_629 + Add_635 + Relu_636, Conv_637 + Relu_638, Conv_250 + Relu_251, Conv_641 + Add_642 + Relu_643, Conv_254 + Add_255 + Relu_256, Conv_258 + Relu_259, Conv_644 + Relu_645, Conv_262, Conv_257 + Add_263 + Relu_264, Conv_648 + Add_649 + Relu_650, Conv_652 + Relu_653, Conv_265 + Relu_266, Conv_656, Conv_651 + Add_657 + Relu_658, Conv_269 + Add_270 + Relu_271, Conv_272 + Relu_273, Conv_659 + Relu_660, Conv_276 + Add_277 + Relu_278, Conv_663 + Add_664 + Relu_665, Conv_666 + Relu_667, Conv_279 + Relu_280, Conv_670 + Add_671 + Relu_672, Conv_283 + Add_284 + Relu_285, Conv_286 + Relu_287, Conv_673 + Relu_674, Conv_290 + Add_291 + Relu_292, Conv_677 + Add_678 + Relu_679, Conv_680 + Relu_681, Conv_293 + Relu_294, Conv_684 + Add_685 + Relu_686, Conv_297 + Add_298 + Relu_299, Conv_300 + Relu_301, Conv_687 + Relu_688, Conv_304 + Add_305 + Relu_306, Conv_691 + Add_692 + Relu_693, Conv_694 + Relu_695, Conv_308 + Relu_309, Conv_312, Conv_698 + Add_699 + Relu_700, Conv_307 + Add_313 + Relu_314, Conv_315 + Relu_316, Conv_702 + Relu_703, Conv_706, Conv_319 + Add_320 + Relu_321, Conv_701 + Add_707 + Relu_708, Conv_709 + Relu_710, Conv_322 + Relu_323, Conv_713 + Add_714 + Relu_715, Conv_326 + Add_327 + Relu_328, Conv_329 + Relu_330, Conv_716 + Relu_717, Conv_333 + Add_334 + Relu_335, Conv_720 + Add_721 + Relu_722, Conv_723 + Relu_724, Conv_336 + Relu_337, Conv_727 + Add_728 + Relu_729, Conv_340 + Add_341 + Relu_342, Conv_343 + Relu_344, Conv_730 + Relu_731, Conv_347 + Add_348 + Relu_349, Conv_734 + Add_735 + Relu_736, Conv_737, Conv_350 + Relu_351, Conv_739 + Add_740, Conv_742 + Add_743, Conv_354 + Add_355 + Relu_356, Conv_357 + Relu_358, Conv_361 + Add_362 + Relu_363, Conv_364 + Relu_365, Conv_368 + Add_369 + Relu_370, Conv_371 + Relu_372, Conv_375 + Add_376 + Relu_377, Conv_378 + Relu_379, Conv_382 + Add_383 + Relu_384, Conv_385 + Relu_386, Conv_389 + Add_390 + Relu_391, Conv_392, Conv_394 + Add_395, Conv_397 + Add_398, Conv_472 || Conv_443 || Conv_438, MatMul_514, MatMul_513, MatMul_592, Conv_607 + Relu_608, Conv_609 + Add_610, Conv_615, Conv_612 + Relu_613, Conv_614 + Add_617 + Relu_618, Conv_752 + Relu_753, Conv_754, Conv_823 || Conv_813 || Conv_809 || Conv_799 || Conv_789 || Conv_785 || Conv_775 || Conv_765, Conv_761 || Conv_825 || Conv_821 || Conv_819 || Conv_817 || Conv_815 || Conv_811 || Conv_807, Conv_805 || Conv_803 || Conv_801 || Conv_797 || Conv_795 || Conv_793 || Conv_791 || Conv_787, Conv_783 || Conv_781 || Conv_779 || Conv_777 || Conv_773 || Conv_771 || Conv_769 || Conv_767, Conv_763 || Conv_759 || Conv_757 || Conv_755
[08/05/2024-12:06:54] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers: Conv_626, Conv_243 + Relu_244, Conv_630 + Relu_631, Conv_247, Conv_634, Conv_637 + Relu_638, Conv_250 + Relu_251, Conv_254 + Add_255 + Relu_256, Conv_258 + Relu_259, Conv_644 + Relu_645, Conv_262, Conv_257 + Add_263 + Relu_264, Conv_652 + Relu_653, Conv_265 + Relu_266, Conv_656, Conv_269 + Add_270 + Relu_271, Conv_272 + Relu_273, Conv_659 + Relu_660, Conv_276 + Add_277 + Relu_278, Conv_666 + Relu_667, Conv_279 + Relu_280, Conv_283 + Add_284 + Relu_285, Conv_286 + Relu_287, Conv_673 + Relu_674, Conv_290 + Add_291 + Relu_292, Conv_680 + Relu_681, Conv_293 + Relu_294, Conv_297 + Add_298 + Relu_299, Conv_300 + Relu_301, Conv_687 + Relu_688, Conv_304 + Add_305 + Relu_306, Conv_694 + Relu_695, Conv_308 + Relu_309, Conv_312, Conv_307 + Add_313 + Relu_314, Conv_315 + Relu_316, Conv_702 + Relu_703, Conv_706, Conv_319 + Add_320 + Relu_321, Conv_709 + Relu_710, Conv_322 + Relu_323, Conv_326 + Add_327 + Relu_328, Conv_329 + Relu_330, Conv_716 + Relu_717, Conv_333 + Add_334 + Relu_335, Conv_723 + Relu_724, Conv_336 + Relu_337, Conv_340 + Add_341 + Relu_342, Conv_343 + Relu_344, Conv_730 + Relu_731, Conv_347 + Add_348 + Relu_349, Conv_737, Conv_350 + Relu_351, Conv_354 + Add_355 + Relu_356, Conv_357 + Relu_358, Conv_361 + Add_362 + Relu_363, Conv_364 + Relu_365, Conv_368 + Add_369 + Relu_370, Conv_371 + Relu_372, Conv_375 + Add_376 + Relu_377, Conv_378 + Relu_379, Conv_382 + Add_383 + Relu_384, Conv_385 + Relu_386, Conv_389 + Add_390 + Relu_391, Conv_392, Conv_394 + Add_395, Conv_397 + Add_398, MatMul_514, MatMul_513, MatMul_592, Conv_607 + Relu_608, Conv_609 + Add_610, Conv_615, Conv_752 + Relu_753, Conv_823 || Conv_813 || Conv_809 || Conv_799 || Conv_789 || Conv_785 || Conv_775 || Conv_765, Conv_761 || Conv_825 || Conv_821 || Conv_819 || Conv_817 || Conv_815 || Conv_811 || Conv_807, Conv_805 || Conv_803 || Conv_801 || Conv_797 || Conv_795 || Conv_793 || Conv_791 || Conv_787, Conv_783 || Conv_781 || Conv_779 || Conv_777 || Conv_773 || Conv_771 || Conv_769 || Conv_767, Conv_763 || Conv_759 || Conv_757 || Conv_755
I marked Layers that are eligible for sparse math but not TRT inference plan picked sparse implementation. The red words in the picture below are what I've marked.
I saw a answer in a previous issue saying that convolutional layers with few channels or convolutional kernel size will not use a sparse implementation.
But I observed that many convolutional layers in my model with a relatively small number of channels picked sparse implementation, while many convolutional layers with a relatively larger number of channels did not pick sparse implementation.
For example, Conv_663 + Add_664 + Relu_665 that has Conv layer with
[288,288,1,1] shape is not picked sparse implementation, while Conv_276 + Add_277 + Relu_278 has that has Conv layer with [160,160,1,1] shape is picked sparse implementation.
So is there any other reason that affect whether or not picked sparse implementation?
Environment
TensorRT Version: 8.5.2.2
NVIDIA GPU:Orin
Operating System: Linux
Python Version (if applicable): 3.8.10
Steps To Reproduce
Commands or scripts:
./trtexec --onnx=sparse.onnx --saveEngine=sparse.trt --sparsity=enable --fp16 --verbose
During ASP training, many structured sparse layers that meet mathematical criteria will be generated, but in TRT, the specific implementation of searching will be based on the strategy with the least global time.
As you say, Conv_663 + Add_664 + Relu_665 eligible for sparse math, but not picked by TRT.
Conv_276 + Add_277 + Relu_278 eligible for sparse math, and picked by TRT.
Conv_663 + Add_664 + Relu_665 as a fusion layer, Conv layer with [288,288,1,1] shape,
Conv_276 + Add_277 + Relu_278 as a fusion layer, Conv layer with [160,160,1,1] shape,
- two convs in-out channel are different
- the input-output format of
Conv_663 + Add_664 + Relu_665vs input-output format ofConv_276 + Add_277 + Relu_278(not show in your desc)
Based on the above, trt will find tactics(both dense and sparse) and choose the Fastest Tactic, like
*************** Autotuning format combination: Half(256000,1000,50,1) -> Half(256000,1000,50,1) ***************
--------------- Timing Runner: /fpe/conv_reduce/Conv + /fpe/act1/Relu (CudnnConvolution)
Tactic: 0x0000000000000000 Time: 1.12616
Tactic: 0x0000000000000001 Time: 0.909769
Tactic: 0x0000000000000002 Time: 1.20569
Tactic: 0x0000000000000004 Time: 56.6324
Tactic: 0x0000000000000005 Time: 2.21428
Tactic: 0x0000000000000038 Time: 1.12753
Tactic: 0x000000000000003a Time: 1.19835
Tactic: 0x000000000000003c Time: 56.6038
Tactic: 0x000000000000003d Time: 2.2058
Fastest Tactic: 0x0000000000000001 Time: 0.909769
Yes, I can see similar outputs in my logs:
[08/05/2024-11:55:04] [V] [TRT] *************** Autotuning format combination: Half(38400,480:2,30,1), Half(38400,480:2,30,1) -> Half(38400,480:2,30,1) ***************
[08/05/2024-11:55:04] [V] [TRT] --------------- Timing Runner: Conv_276 + Add_277 + Relu_278 (CublasConvolution)
[08/05/2024-11:55:04] [V] [TRT] CublasConvolution has no valid tactics for this config, skipping
[08/05/2024-11:55:04] [V] [TRT] --------------- Timing Runner: Conv_276 + Add_277 + Relu_278 (CaskConvolution)
[08/05/2024-11:55:04] [V] [TRT] CaskConvolution has no valid tactics for this config, skipping
[08/05/2024-11:55:04] [V] [TRT] --------------- Timing Runner: Conv_276 + Add_277 + Relu_278 (CaskFlattenConvolution)
[08/05/2024-11:55:04] [V] [TRT] CaskFlattenConvolution has no valid tactics for this config, skipping
[08/05/2024-11:55:04] [V] [TRT] *************** Autotuning format combination: Half(19200,1:4,1200,40), Half(19200,1:4,1200,40) -> Half(19200,1:4,1200,40) ***************
[08/05/2024-11:55:04] [V] [TRT] --------------- Timing Runner: Conv_276 + Add_277 + Relu_278 (CublasConvolution)
[08/05/2024-11:55:04] [V] [TRT] CublasConvolution has no valid tactics for this config, skipping
[08/05/2024-11:55:04] [V] [TRT] --------------- Timing Runner: Conv_276 + Add_277 + Relu_278 (CaskConvolution)
[08/05/2024-11:55:04] [V] [TRT] CaskConvolution has no valid tactics for this config, skipping
[08/05/2024-11:55:04] [V] [TRT] --------------- Timing Runner: Conv_276 + Add_277 + Relu_278 (CaskFlattenConvolution)
[08/05/2024-11:55:04] [V] [TRT] CaskFlattenConvolution has no valid tactics for this config, skipping
[08/05/2024-11:55:04] [V] [TRT] *************** Autotuning format combination: Half(9600,1:8,600,20), Float(76800,480,30,1) -> Float(76800,480,30,1) ***************
[08/05/2024-11:55:04] [V] [TRT] --------------- Timing Runner: Conv_276 + Add_277 + Relu_278 (CublasConvolution)
[08/05/2024-11:55:04] [V] [TRT] CublasConvolution has no valid tactics for this config, skipping
[08/05/2024-11:55:04] [V] [TRT] --------------- Timing Runner: Conv_276 + Add_277 + Relu_278 (CaskConvolution)
[08/05/2024-11:55:04] [V] [TRT] CaskConvolution has no valid tactics for this config, skipping
[08/05/2024-11:55:04] [V] [TRT] --------------- Timing Runner: Conv_276 + Add_277 + Relu_278 (CaskFlattenConvolution)
[08/05/2024-11:55:04] [V] [TRT] CaskFlattenConvolution has no valid tactics for this config, skipping
[08/05/2024-11:55:04] [V] [TRT] *************** Autotuning format combination: Half(9600,1:8,600,20), Half(9600,1:8,600,20) -> Half(9600,1:8,600,20) ***************
[08/05/2024-11:55:04] [V] [TRT] *************** Autotuning format combination: Half(4800,1:16,300,10), Half(4800,1:16,300,10) -> Half(4800,1:16,300,10) ***************
[08/05/2024-11:55:04] [V] [TRT] =============== Computing costs for
[08/05/2024-11:55:04] [V] [TRT] *************** Autotuning format combination: Float(1622016,5632,64,1), Float(1622016,5632,64,1) -> Float(1622016,5632,64,1) ***************
[08/05/2024-11:55:04] [V] [TRT] --------------- Timing Runner: Conv_663 + Add_664 + Relu_665 (CudnnConvolution)
[08/05/2024-11:55:04] [V] [TRT] Tactic: 0x0000000000000000 Time: 1.13622
[08/05/2024-11:55:04] [V] [TRT] Tactic: 0x0000000000000001 Time: 0.488091
[08/05/2024-11:55:04] [V] [TRT] Tactic: 0x0000000000000002 Time: 1.41035
[08/05/2024-11:55:13] [V] [TRT] Tactic: 0x0000000000000004 Time: 898.262
[08/05/2024-11:55:14] [V] [TRT] Tactic: 0x0000000000000005 Time: 2.15659
[08/05/2024-11:55:14] [V] [TRT] Tactic: 0x0000000000000038 Time: 1.14235
[08/05/2024-11:55:14] [V] [TRT] Tactic: 0x0000000000000039 Time: 0.490153
[08/05/2024-11:55:14] [V] [TRT] Tactic: 0x000000000000003a Time: 1.41814
[08/05/2024-11:55:23] [V] [TRT] Tactic: 0x000000000000003c Time: 899.604
[08/05/2024-11:55:23] [V] [TRT] Tactic: 0x000000000000003d Time: 2.13477
[08/05/2024-11:55:23] [V] [TRT] Tactic: 0x0000000000000070 Time: 1.13643
[08/05/2024-11:55:23] [V] [TRT] Tactic: 0x0000000000000071 Time: 1.13343
[08/05/2024-11:55:23] [V] [TRT] Tactic: 0x0000000000000072 Time: 1.40724
[08/05/2024-11:55:32] [V] [TRT] Tactic: 0x0000000000000074 Time: 898.031
[08/05/2024-11:55:33] [V] [TRT] Tactic: 0x0000000000000075 Time: 2.16335
[08/05/2024-11:55:33] [V] [TRT] Fastest Tactic: 0x0000000000000001 Time: 0.488091
[08/05/2024-11:55:33] [V] [TRT] --------------- Timing Runner: Conv_663 + Add_664 + Relu_665 (CublasConvolution)
[08/05/2024-11:55:33] [V] [TRT] CublasConvolution has no valid tactics for this config, skipping
[08/05/2024-11:55:33] [V] [TRT] --------------- Timing Runner: Conv_663 + Add_664 + Relu_665 (CaskGemmConvolution)
[08/05/2024-11:55:33] [V] [TRT] CaskGemmConvolution has no valid tactics for this config, skipping
......
Layer(CaskConvolution): Conv_276 + Add_277 + Relu_278, Tactic: 0x0242190371fcab5b, onnx::Conv_1244 (Half[2,160:16,16,30]), onnx::Conv_1238 (Half[2,160:16,16,30]) -> onnx::Conv_1248 (Half[2,160:16,16,30])
Layer(NoOp): Reformatting CopyNode for Input Tensor 0 to Conv_663 + Add_664 + Relu_665, Tactic: 0x0000000000000000, onnx::Conv_1906 (Half[1,288:16,88,64]) -> Reformatted Input Tensor 0 to Conv_663 + Add_664 + Relu_665 (Half[1,288:8,88,64])
Layer(NoOp): Reformatting CopyNode for Input Tensor 1 to Conv_663 + Add_664 + Relu_665, Tactic: 0x0000000000000000, onnx::Conv_1900 (Half[1,288:16,88,64]) -> Reformatted Input Tensor 1 to Conv_663 + Add_664 + Relu_665 (Half[1,288:8,88,64])
Layer(CaskConvolution): Conv_663 + Add_664 + Relu_665, Tactic: 0xd80cb0f3373aef38, Reformatted Input Tensor 0 to Conv_663 + Add_664 + Relu_665 (Half[1,288:8,88,64]), Reformatted Input Tensor 1 to Conv_663 + Add_664 + Relu_665 (Half[1,288:8,88,64]) -> onnx::Conv_1910 (Half[1,288:8,88,64])
I wondered why it produced such results. Conv_663 + Add_664 + Relu_665 has more channels than Conv_276 + Add_277 + Relu_278. If Conv_276 + Add_277 + Relu_278 is picked sparse implementation, then Conv_663 + Add_664 + Relu_665 should be too. But, the results don't match my expectations.
You can run follow cmd , then upload the .json files .
./trtexec --onnx=sparse.onnx --saveEngine=sparse.trt --sparsity=enable --fp16 --verbose \
--separateProfileRun \
--profilingVerbosity=detailed \
--dumpProfile \
--dumpLayerInfo \
--exportProfile=li_profile.json \
--exportLayerInfo=li_layinfo.json
Due to some limitations I can't upload all the .json files. I copied all the information involving Conv_663 + Add_664 + Relu_665 and Conv_276 + Add_277 + Relu_278. li_layinfo.json:
{"Layers": [
......
,{
"Name": "Conv_276 + Add_277 + Relu_278",
"LayerType": "CaskConvolution",
"Inputs": [
{
"Name": "onnx::Conv_1244",
"Location": "Device",
"Dimensions": [2,160,16,30],
"Format/Datatype": "Channel major FP16 format where channel % 16 == 0"
},
{
"Name": "onnx::Conv_1238",
"Location": "Device",
"Dimensions": [2,160,16,30],
"Format/Datatype": "Channel major FP16 format where channel % 16 == 0"
}],
"Outputs": [
{
"Name": "onnx::Conv_1248",
"Location": "Device",
"Dimensions": [2,160,16,30],
"Format/Datatype": "Channel major FP16 format where channel % 16 == 0"
}],
"ParameterType": "Convolution",
"Kernel": [1,1],
"PaddingMode": "kEXPLICIT_ROUND_DOWN",
"PrePadding": [0,0],
"PostPadding": [0,0],
"Stride": [1,1],
"Dilation": [1,1],
"OutMaps": 160,
"Groups": 1,
"Weights": {"Type": "Half", "Count": 25600},
"Bias": {"Type": "Half", "Count": 160},
"HasSparseWeights": 1,
"HasDynamicFilter": 0,
"HasDynamicBias": 0,
"HasResidual": 1,
"ConvXAsActInputIdx": -1,
"BiasAsActInputIdx": -1,
"ResAsActInputIdx": -1,
"Activation": "RELU",
"HasBias": 1,
"HasReLU": 1,
"TacticName": "sm80_xmma_fprop_sparse_conv_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x128x64_stage3_warpsize1x4x1_g1_sptensor16x8x32_t1r1s1",
"TacticValue": "0x7c131d76d207c8d1"
},{
"Name": "Reformatting CopyNode for Input Tensor 0 to Conv_663 + Add_664 + Relu_665",
"LayerType": "NoOp",
"Inputs": [
{
"Name": "onnx::Conv_1906",
"Location": "Device",
"Dimensions": [1,288,88,64],
"Format/Datatype": "Channel major FP16 format where channel % 16 == 0"
}],
"Outputs": [
{
"Name": "Reformatted Input Tensor 0 to Conv_663 + Add_664 + Relu_665",
"Location": "Device",
"Dimensions": [1,288,88,64],
"Format/Datatype": "Channel major FP16 format where channel % 8 == 0"
}],
"TacticValue": "0x0000000000000000"
},{
"Name": "Conv_663 + Add_664 + Relu_665",
"LayerType": "CaskConvolution",
"Inputs": [
{
"Name": "Reformatted Input Tensor 0 to Conv_663 + Add_664 + Relu_665",
"Location": "Device",
"Dimensions": [1,288,88,64],
"Format/Datatype": "Channel major FP16 format where channel % 8 == 0"
},
{
"Name": "onnx::Conv_1900",
"Location": "Device",
"Dimensions": [1,288,88,64],
"Format/Datatype": "Channel major FP16 format where channel % 8 == 0"
}],
"Outputs": [
{
"Name": "onnx::Conv_1910",
"Location": "Device",
"Dimensions": [1,288,88,64],
"Format/Datatype": "Channel major FP16 format where channel % 8 == 0"
}],
"ParameterType": "Convolution",
"Kernel": [1,1],
"PaddingMode": "kEXPLICIT_ROUND_DOWN",
"PrePadding": [0,0],
"PostPadding": [0,0],
"Stride": [1,1],
"Dilation": [1,1],
"OutMaps": 288,
"Groups": 1,
"Weights": {"Type": "Half", "Count": 82944},
"Bias": {"Type": "Half", "Count": 288},
"HasSparseWeights": 1,
"HasDynamicFilter": 0,
"HasDynamicBias": 0,
"HasResidual": 1,
"ConvXAsActInputIdx": -1,
"BiasAsActInputIdx": -1,
"ResAsActInputIdx": -1,
"Activation": "RELU",
"HasBias": 1,
"HasReLU": 1,
"TacticName": "sm80_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize256x64x32_stage3_warpsize4x1x1_g1_tensor16x8x16_simple_t1r1s1",
"TacticValue": "0x3eda3b336995a6f0"
},......
li_profile.json:
[
{ "count" : 99 }
......
, { "name" : "Conv_274 + Relu_275", "timeMs" : 2.83002, "averageMs" : 0.028586, "medianMs" : 0.028608, "percentage" : 0.100822 }
, { "name" : "Conv_276 + Add_277 + Relu_278", "timeMs" : 2.05859, "averageMs" : 0.0207939, "medianMs" : 0.020768, "percentage" : 0.0733392 }
, { "name" : "Reformatting CopyNode for Input Tensor 0 to Conv_663 + Add_664 + Relu_665", "timeMs" : 0, "averageMs" : 0, "medianMs" : 0, "percentage" : 0 }
, { "name" : "Conv_663 + Add_664 + Relu_665", "timeMs" : 10.1323, "averageMs" : 0.102347, "medianMs" : 0.102496, "percentage" : 0.360973 }
, { "name" : "Reformatting CopyNode for Input Tensor 0 to Conv_666 + Relu_667", "timeMs" : 0, "averageMs" : 0, "medianMs" : 0, "percentage" : 0 }
......
You can tar/rar to json files.
Sorry. The problem is that the need for confidentiality prevents me from uploading the entire contents of the json file. I completely understand if it can't be diagnosed based on the current information.
In short, sparse convolution may not necessarily be faster than dense convolution in scoring scenarios.
In short, sparse convolution may not necessarily be faster than dense convolution in scoring scenarios.简而言之,在评分场景中,稀疏卷积不一定比密集卷积更快。
why?