llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Possible performance boost with 2-pass online softmax

Open zixuanweeei opened this issue 9 months ago • 0 comments

Per the discussion in https://arxiv.org/abs/1805.02867, I am wondering if there is still a potential performance boost with the 2-pass online softmax. Flash attention, which is already enabled in this project, has already fused up the softmax using the online normalizer. If the single op is still used, I hope there will be some profit. It is eventually determined by the model architecture and the project implementation. I hope someone could help on the analysis.

zixuanweeei avatar May 15 '24 15:05 zixuanweeei