TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

🐛 [Bug] Conversion error when using torch-TRT to run the bert model after qat quantization

Open lixiaolx opened this issue 3 years ago • 2 comments

Bug Description

When using the latest code to test the bert model after qat quantization, the following error occurs and the model cannot be run. image

Error corresponds to the code location ( https://github.com/pytorch/TensorRT/blob/master/core/partitioning/shape_analysis.cpp#L93 )

Through log analysis, it is found that, First, when using the latest code, the bert model will be divided into subgraphs because it supports tuple ( https://github.com/pytorch/TensorRT/blob/master/core/compiler.cpp#L434 ); secondly, since the freeze process will not be triggered in the QAT (int8) mode ( https://github.com/pytorch/TensorRT/blob/master/core/lowering/lowering.cpp#L90 ), in the process of converting and dividing the subgraph, the weight is also used as a input, it will make the sub-graph input more after segmentation (3->398) to trigger the above error.

Further try to roll back the code to version 1.1, the code without the tuple function can run the qat model corresponding to bert ( https://github.com/pytorch/TensorRT/blob/release/1.1/core/compiler.cpp#L423 ),and the subgraph segmentation process is not triggered

To Reproduce

step1: bert model download ( https://zenodo.org/record/4792496#.YyBmlhNBxJU ) step2: Follow the documentation steps to generate the jit.trace model corresponding to qat step3: jit.load step4:torch.compile

lixiaolx avatar Sep 13 '22 11:09 lixiaolx

Hi @lixiaolx is the model trained using the PyTorch QAT toolkit?

ncomly-nvidia avatar Sep 19 '22 16:09 ncomly-nvidia

Hi @lixiaolx is the model trained using the PyTorch QAT toolkit?

Yes, this bert model is using pytorch QAT tools. I can run it on the version of torch-tensorrt that does not support the tuple function. After using the latest version, the segmentation model graph is triggered, and the above error occurs.,This mistake I located is in the shape analysis part ( https://github.com/pytorch/TensorRT/blob/master/core/partitioning/shape_analysis.cpp#L93 )

lixiaolx avatar Sep 20 '22 02:09 lixiaolx

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

github-actions[bot] avatar Dec 20 '22 00:12 github-actions[bot]

This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days

github-actions[bot] avatar Apr 04 '23 00:04 github-actions[bot]