Pengle Zhang comments

Results 8 comments of


                                            Pengle Zhang

How to use w transformers?

You can use [patch_hf](https://github.com/thunlp/InfLLM/blob/main/inf_llm/utils/patch.py#L33) for transformers. For this usage, you can refer to the [integration](https://github.com/thunlp/InfLLM/blob/main/inf_llm/chat.py#L347) in chat.py. Load configuration as a dict, and pass it to the patch_hf with your...

IndexErrors when attempting to run triton flashattention

It seems that [Triton 2.2.0 does not support V100](https://github.com/pytorch/pytorch/issues/117146). Try `pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly==2.1.0.dev20231014192330`. And change the `torch_dtype` to `torch.half`.

接入一个新的模型需要满足哪些条件

你的理解是对的，基本上目前使用 rope 的模型都可以使用 infLLM. 我们没有太多时间维护这个仓库，现在主要用于提供论文结果复现. 如果你需要适配其他开源模型，可以参照 [patch.py](https://github.com/thunlp/InfLLM/blob/main/inf_llm/utils/patch.py) 中的实现，加入其他模型的 attention forward 替换.

running with infllm-12k.yaml meets errors

It looks like we made a mistake in the configuration. Change the max_cached_block to something larger than 64.

Implementation of Streaming-llm

Hi, it says "StreamingLLM focuses on positions within the cache rather than those in the original text" in section 3.2. And we implement this by, for all query tokens, placing...

Error when reproducing mistral results

Hi, you can add `repeat_kv` from `inf_llm/attention/utils.py` before the qk computation.

代码context_manager.py上的问题

1. 顺序对结果没有影响，计算是等价的 2. global_h_q 已经做过[旋转](https://github.com/thunlp/InfLLM/blob/main/inf_llm/attention/context_manager.py#L745)了，global_h_k 没做 rope 相当于旋转 0 度如果只需要对应论文算法的代码，[初始](https://github.com/thunlp/InfLLM/blob/init/inf_llm/attention/context_manager.py)版本会好读一些，目前的版本优化了性能。

代码context_manager.py上的问题

> 好的感谢！ > 请问你们有没有做过对 global_h_k 做旋转的相关实验呢目前没有，因为按照[rerope](https://www.spaces.ac.cn/archives/9708)的长度拓展应该使用相同的旋转角度，你可以尝试一下其他旋转方法