InfLLM
InfLLM copied to clipboard
Implementation of Streaming-llm
THX for the great work! I noticed the implementation of streaming-llm fixing the position of n_init tokens(https://github.com/thunlp/InfLLM/blob/main/inf_llm/attention/stream_llm.py#L69), while in the original paper of streaming-llm, it said the n_init tokens use the different positions, so does the implementation has some problem?
Hi, it says "StreamingLLM focuses on positions within the cache rather than those in the original text" in section 3.2. And we implement this by, for all query tokens, placing the init tokens within the first n_init positions of their seen kv cache window.