Pengle Zhang

Results 8 comments of Pengle Zhang

You can use [patch_hf](https://github.com/thunlp/InfLLM/blob/main/inf_llm/utils/patch.py#L33) for transformers. For this usage, you can refer to the [integration](https://github.com/thunlp/InfLLM/blob/main/inf_llm/chat.py#L347) in chat.py. Load configuration as a dict, and pass it to the patch_hf with your...

It seems that [Triton 2.2.0 does not support V100](https://github.com/pytorch/pytorch/issues/117146). Try `pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly==2.1.0.dev20231014192330`. And change the `torch_dtype` to `torch.half`.

你的理解是对的,基本上目前使用 rope 的模型都可以使用 infLLM. 我们没有太多时间维护这个仓库,现在主要用于提供论文结果复现. 如果你需要适配其他开源模型,可以参照 [patch.py](https://github.com/thunlp/InfLLM/blob/main/inf_llm/utils/patch.py) 中的实现,加入其他模型的 attention forward 替换.

It looks like we made a mistake in the configuration. Change the max_cached_block to something larger than 64.

Hi, it says "StreamingLLM focuses on positions within the cache rather than those in the original text" in section 3.2. And we implement this by, for all query tokens, placing...

Hi, you can add `repeat_kv` from `inf_llm/attention/utils.py` before the qk computation.

1. 顺序对结果没有影响,计算是等价的 2. global_h_q 已经做过[旋转](https://github.com/thunlp/InfLLM/blob/main/inf_llm/attention/context_manager.py#L745)了,global_h_k 没做 rope 相当于旋转 0 度 如果只需要对应论文算法的代码,[初始](https://github.com/thunlp/InfLLM/blob/init/inf_llm/attention/context_manager.py)版本会好读一些,目前的版本优化了性能。

> 好的感谢! > 请问你们有没有做过对 global_h_k 做旋转的相关实验呢 目前没有,因为按照[rerope](https://www.spaces.ac.cn/archives/9708)的长度拓展应该使用相同的旋转角度,你可以尝试一下其他旋转方法