Brief question about model structure

Open CoinCheung opened this issue 7 months ago • 0 comments

We know that for qkv attention, the result of q @ k should be divided by sqrt(d), will this also be same for efficientVit?

Does relu-based-linear-attention need layernorm or position embedding?

Does relu-based-linear-attention need multi-head attention?

May 26 '25 08:05 CoinCheung