mesh
mesh copied to clipboard
SelfAttention & EncDecAttention in mesh transformer allow different values for query, key, value
This paper Low-Rank Bottleneck in Multi-head Attention Models suggests that we could fix the head size and keep hidden size unchanged. Could you support setting d_k
, d_q
, d_v
independently instead of d_kv
.