felixzhu555

Results 3 comments of felixzhu555

Yep, trying to implement the logic from that paper. Their repo is [https://github.com/mit-han-lab/streaming-llm](https://github.com/mit-han-lab/streaming-llm).

Based on that timing breakdown, can you try to replace `mask[allowed_tokens] = 0` by using torch index_fill? e.g. `mask.index_fill_(0, allowed_tokens, 0)` This might be faster than manually indexing the mask...

hi @hustxiayang, sorry this PR likely won't get merged, it remains an experimental prototype based on an older version of vLLM. After the ongoing engine refactor is complete, the memory...