felixzhu555 comments

Repositories
Issues
Comments

Results 3 comments of


                                            felixzhu555

[Misc] Add attention sinks

Yep, trying to implement the logic from that paper. Their repo is [https://github.com/mit-han-lab/streaming-llm](https://github.com/mit-han-lab/streaming-llm).

[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization

Based on that timing breakdown, can you try to replace `mask[allowed_tokens] = 0` by using torch index_fill? e.g. `mask.index_fill_(0, allowed_tokens, 0)` This might be faster than manually indexing the mask...

[Misc] Add attention sinks

hi @hustxiayang, sorry this PR likely won't get merged, it remains an experimental prototype based on an older version of vLLM. After the ongoing engine refactor is complete, the memory...