Aashraya
Aashraya
Any update on this issue?
@symphonylyh Thank you for the detailed response. In my case, outputs of decoder are way off as compared to the HF model. I have tried optimising using TensorRT as well...
My model is flan t5 xl with tp 1. Yes, I am using bfloat16 and not fp16.
Thanks @symphonylyh There is another similar code fragment [link](https://github.com/NVIDIA/TensorRT-LLM/blob/71d8d4d3dc655671f32535d6d2b60cab87f36e87/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h#L2095C1-L2098C49) Do we need to change this as well?
gotcha... tested on some examples, seems to be working fine now. will update after exhaustive testing