TensorRT
TensorRT copied to clipboard
TensorRT not supporting structured sparsity for Matrix Multiplication and Linear Layers?
I'm implementing N:M fine-grained sparsity acceleration on ViT, ConvNets. It seems that during the acceleration process of ConvNets (e.g. the official example for ResNext101-32x8d), The "Layers eligible for sparse math" still contains the Gemm for the last layer of ResNext101. However, when it comes to " TRT inference plan picked sparse implementation for layers:", GeMM is gone.
[07/19/2022-11:08:35] [I] [TRT] (Sparsity) Layers eligible for sparse math: Conv_3 + Relu_4, Conv_8, Conv_7 + Add_9 + Relu_10, Conv_11 + Relu_12, Conv_15 +
Add_16 + Relu_17, Conv_18 + Relu_19, Conv_22 + Add_23 + Relu_24, Conv_25 + Relu_26, Conv_30, Conv_29 + Add_31 + Relu_32, Conv_33 + Relu_34, Conv_37 + Add_38
+ Relu_39, Conv_40 + Relu_41, Conv_44 + Add_45 + Relu_46, Conv_47 + Relu_48, Conv_51 + Add_52 + Relu_53, Conv_54 + Relu_55, Conv_59, Conv_58 + Add_60 + Rel
u_61, Conv_62 + Relu_63, Conv_66 + Add_67 + Relu_68, Conv_69 + Relu_70, Conv_73 + Add_74 + Relu_75, Conv_76 + Relu_77, Conv_80 + Add_81 + Relu_82, Conv_83 +
Relu_84, Conv_87 + Add_88 + Relu_89, Conv_90 + Relu_91, Conv_94 + Add_95 + Relu_96, Conv_97 + Relu_98, Conv_101 + Add_102 + Relu_103, Conv_104 + Relu_105,
Conv_108 + Add_109 + Relu_110, Conv_111 + Relu_112, Conv_115 + Add_116 + Relu_117, Conv_118 + Relu_119, Conv_122 + Add_123 + Relu_124, Conv_125 + Relu_126,
Conv_129 + Add_130 + Relu_131, Conv_132 + Relu_133, Conv_136 + Add_137 + Relu_138, Conv_139 + Relu_140, Conv_143 + Add_144 + Relu_145, Conv_146 + Relu_147,
Conv_150 + Add_151 + Relu_152, Conv_153 + Relu_154, Conv_157 + Add_158 + Relu_159, Conv_160 + Relu_161, Conv_164 + Add_165 + Relu_166, Conv_167 + Relu_168,
Conv_171 + Add_172 + Relu_173, Conv_174 + Relu_175, Conv_178 + Add_179 + Relu_180, Conv_181 + Relu_182, Conv_185 + Add_186 + Relu_187, Conv_188 + Relu_189,
Conv_192 + Add_193 + Relu_194, Conv_195 + Relu_196, Conv_199 + Add_200 + Relu_201, Conv_202 + Relu_203, Conv_206 + Add_207 + Relu_208, Conv_209 + Relu_210,
Conv_213 + Add_214 + Relu_215, Conv_216 + Relu_217, Conv_221, Conv_220 + Add_222 + Relu_223, Conv_224 + Relu_225, Conv_228 + Add_229 + Relu_230, Conv_231 +
Relu_232, Conv_235 + Add_236 + Relu_237, Gemm_240
[07/19/2022-11:08:35] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers: Conv_3 + Relu_4, Conv_8, Conv_11 + Relu_12, Conv_18 +
Relu_19, Conv_25 + Relu_26, Conv_30, Conv_33 + Relu_34, Conv_40 + Relu_41, Conv_47 + Relu_48, Conv_54 + Relu_55, Conv_59, Conv_58 + Add_60 + Relu_61, Conv_
62 + Relu_63, Conv_66 + Add_67 + Relu_68, Conv_69 + Relu_70, Conv_73 + Add_74 + Relu_75, Conv_76 + Relu_77, Conv_80 + Add_81 + Relu_82, Conv_83 + Relu_84, C
onv_87 + Add_88 + Relu_89, Conv_90 + Relu_91, Conv_94 + Add_95 + Relu_96, Conv_97 + Relu_98, Conv_101 + Add_102 + Relu_103, Conv_104 + Relu_105, Conv_108 +
Add_109 + Relu_110, Conv_111 + Relu_112, Conv_115 + Add_116 + Relu_117, Conv_118 + Relu_119, Conv_122 + Add_123 + Relu_124, Conv_125 + Relu_126, Conv_129 +
Add_130 + Relu_131, Conv_132 + Relu_133, Conv_136 + Add_137 + Relu_138, Conv_139 + Relu_140, Conv_143 + Add_144 + Relu_145, Conv_146 + Relu_147, Conv_150 +
Add_151 + Relu_152, Conv_153 + Relu_154, Conv_157 + Add_158 + Relu_159, Conv_160 + Relu_161, Conv_164 + Add_165 + Relu_166, Conv_167 + Relu_168, Conv_171 +
Add_172 + Relu_173, Conv_174 + Relu_175, Conv_178 + Add_179 + Relu_180, Conv_181 + Relu_182, Conv_185 + Add_186 + Relu_187, Conv_188 + Relu_189, Conv_192 +
Add_193 + Relu_194, Conv_195 + Relu_196, Conv_199 + Add_200 + Relu_201, Conv_202 + Relu_203, Conv_206 + Add_207 + Relu_208, Conv_209 + Relu_210, Conv_213 +
Add_214 + Relu_215, Conv_216 + Relu_217, Conv_221, Conv_220 + Add_222 + Relu_223, Conv_224 + Relu_225, Conv_228 + Add_229 + Relu_230, Conv_231 + Relu_232, C
onv_235 + Add_236 + Relu_237
Then I also do testing on "vit_tiny_patch16_224", things got more direct that
[07/19/2022-11:21:28] [I] [TRT] (Sparsity) Layers eligible for sparse math: Conv_22
[07/19/2022-11:21:28] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers:
The only layer that is eligible for sparse math is CONV22 which is the convolution operation in the patch embedding layer and is not available to be picked as sparse implementation.
The code I used to transfer model from pytorch timm to onnx is :
model = timm.create_model(args.model, pretrained=True).eval().to(device)
optimizer = create_optimizer_v2(model, **optimizer_kwargs(cfg=args))
dummy_input = torch.randn(args.batch_size, 3, 224, 224).to(device)
ASP.prune_trained_model(model, optimizer)
torch.onnx.export(model, dummy_input, f"onnx_model/{args.model}_sparse.onnx", verbose=False)
The bash to test inference of onnx model is:
trtexec --onnx=onnx_model/vit_tiny_patch16_224_sparse.onnx --saveEngine=vit_tiny_patch16_224_sparse_engine.trt --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16 --sparsity=force
[Update using INT8 flag command]
trtexec --onnx=onnx_model/vit_tiny_patch16_224_sparse.onnx --saveEngine=vit_tiny_patch16_224_pytorch.trt --explicitBatch --inputIOFormats=int8:chw --outputIOFormats=int8:chw --int8 --sparsity=force
provides:
[07/19/2022-11:58:20] [I] [TRT] (Sparsity) Layers eligible for sparse math: Conv_22, MatMul_51, MatMul_69, MatMul_83, MatMul_93, MatMul_107, MatMul_125, Mat
Mul_139, MatMul_149, MatMul_163, MatMul_181, MatMul_195, MatMul_205, MatMul_219, MatMul_237, MatMul_251, MatMul_261, MatMul_275, MatMul_293, MatMul_307, Mat
Mul_317, MatMul_331, MatMul_349, MatMul_363, MatMul_373, MatMul_387, MatMul_405, MatMul_419, MatMul_429, MatMul_443, MatMul_461, MatMul_475, MatMul_485, Mat
Mul_499, MatMul_517, MatMul_531, MatMul_541, MatMul_555, MatMul_573, MatMul_587, MatMul_597, MatMul_611, MatMul_629, MatMul_643, MatMul_653, MatMul_667, Mat
Mul_685, MatMul_699, MatMul_709, Gemm_725
[07/19/2022-11:58:20] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers:
Add MatMul operations to "Layers eligible for sparse math" but still not available for " TRT inference plan picked sparse implementation for layers" Any idea why MatMul needs to be set to INT8 and is there a way to make it picked up by sparse implementation?
TensorRT didn't pick the sparse implementation is because dense implementation is faster than sparse, and TRT will choose the fastest kernel. a similar thing happened in fp16 and int8. if fp16 is faster(globally), then trt will choose fp16 instead of int8.
If you want to force TRT to use sparse kernels, you can do it via IAlgorithmSelector(https://github.com/NVIDIA/TensorRT/tree/master/samples/sampleAlgorithmSelector)
closing since no activity for more than 3 weeks, please reopen if you still have question, thanks!