llama.cpp
llama.cpp copied to clipboard
Possible performance boost with 2-pass online softmax
Per the discussion in https://arxiv.org/abs/1805.02867, I am wondering if there is still a potential performance boost with the 2-pass online softmax. Flash attention, which is already enabled in this project, has already fused up the softmax using the online normalizer. If the single op is still used, I hope there will be some profit. It is eventually determined by the model architecture and the project implementation. I hope someone could help on the analysis.