FasterTransformer
FasterTransformer copied to clipboard
Possible bug in decoder cross attention
Hi,
If I take the same encoder input and pad it to a different maximum length, then I get noticeably different encoder memory key/value tensors from decoder cross attention. And with some inputs this results in slightly different result tokens. After reviewing the code, I found the is_batch_major_cache_
argument, which is true
by default. I tried attention with is_batch_major_cache_ = false
and the key/value and result tokens mismatch was gone. My guess is that the default implementation of decoder cross attention is not handling memory lengths correctly.
Please provide a reproduce steps as described in bug template. Thank you.
Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.