torchtitan consider - enable streaming attention as default for llama models (1-4M context)

consider - enable streaming attention as default for llama models (1-4M context)

Open lessw2020 opened this issue 1 year ago • 0 comments

for the price of 4 additional tokens (first four) we can enable streaming window attention and enable extremely long context length (1-4M?).

"we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup."

Mixtral is using the sliding window approach. This might be an easy add to showcase the newest attention, though it's not 'core' aspect for PTD. See: https://arxiv.org/abs/2309.17453 I can make a PR to enable if there is interest.

Feb 25 '24 05:02 lessw2020

torchtitan torchtitan copied to clipboard

consider - enable streaming attention as default for llama models (1-4M context)

torchtitan
torchtitan copied to clipboard