fairseq
fairseq copied to clipboard
torch 1.12 improve fairseq TransformerLayer ?
According to this link, torch 1.12.0 improve inferece speed of TransformerEncoder, TransformerEncoderLayer or MultiheadAttention (MHA) in specific conditions (when we use lots of padding tokens) by fusing cuda kernel so on.
However, fairseq use it's own TransformerLayer. Despite this, is there any improvement in fairseq too? or is it better to use pytorch's Transformer Layers?

Not now, I would believe.
Because this new feature "fast path" is applied when why_not_fast_path==False
( ''
is False) and using torch._native_multi_head_attention
, which is implemented with C.
Fairseq uses F.multi_head_attention_forward
which is the method called when why_not_fast_path==True
(non-empty string is True).