TensorRT
TensorRT copied to clipboard
Implementation bug in Disentangled Attention Plugin
Description
For the tmp values are precomputed for re-use, tmp is calculated as below:
https://github.com/NVIDIA/TensorRT/blob/78245b0ac2af9a208ed02e5257bfd3ee7ae8a88d/plugin/disentangledAttentionPlugin/disentangledKernel.cu#L122
The sequence length dimResult.y
is wrongly used as max relative position.
But according to the huggingface implementation and the Microsoft deberta implementation, it should be:
float tmp = (mid - 1) / (logf(dimData1.z - 1) - tmp1) * (1 - kEPSILON);
Environment
TensorRT Version: 8.6
NVIDIA GPU: N/A
NVIDIA Driver Version: N/A
CUDA Version: N/A
CUDNN Version: N/A
Operating System: N/A
Python Version (if applicable):
Tensorflow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if so, version):
Relevant Files
Model link: N/A
Steps To Reproduce
Commands or scripts: N/A
Have you tried the latest release?: N/A
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt
): N/A
@samurdhikaru @ttyio ^ ^
@zerollzeng could you help to create internal issue for this? thank you!
Filed internal bug 4424499 to track this.
Thanks @fillmore . I previously fixed this in FT https://github.com/NVIDIA/FasterTransformer/blob/df4a7534860137e060e18d2ebf019906120ea204/src/fastertransformer/kernels/disentangled_attention_kernels.cu#L268, but forgot to sync to TRT plugin
@symphonylyh , any ETA about the fix in TensorRT?
@fillmore Please expect the fix in the next TRT release. Thanks.
this is fixed in main
, closing and thanks all!