stack-heap-overflow

Results 4 comments of stack-heap-overflow
trafficstars

Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's...

HuggingFace代码中`accelerate`库对模型的显存分配计算有问题,目前示例代码已修改,预计大幅缩短模型加载速度。 加载模型的代码修改为: ```python model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager") ```

可能是vllm使用的kernel的兼容性问题? 可以尝试使用eager模式启动api看是否还会有同样问题(readme中的demo也是eager模式):在命令行参数中加入`--enforce-eager`。

需要安装的python库包括:`torch`,`transformers`,`accelerate`。 相关库不同版本的兼容性并没有详细测过,这里可以给一个我测试用的环境供参考,不一定需要严格符合: - `torch == 2.1.0` - `transformers == 4.39.3` - `accelerate == 0.29.3`