llama.cpp Possible performance boost with 2-pass online softmax

Possible performance boost with 2-pass online softmax

Open zixuanweeei opened this issue 9 months ago • 0 comments

Per the discussion in https://arxiv.org/abs/1805.02867, I am wondering if there is still a potential performance boost with the 2-pass online softmax. Flash attention, which is already enabled in this project, has already fused up the softmax using the online normalizer. If the single op is still used, I hope there will be some profit. It is eventually determined by the model architecture and the project implementation. I hope someone could help on the analysis.

May 15 '24 15:05 zixuanweeei

llama.cpp llama.cpp copied to clipboard

Possible performance boost with 2-pass online softmax

llama.cpp
llama.cpp copied to clipboard