stack-heap-overflow comments

Results 4 comments of


                                            stack-heap-overflow

trafficstars

How to deploy in VLLM?

Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's...

HuggingFace代码中`accelerate`库对模型的显存分配计算有问题，目前示例代码已修改，预计大幅缩短模型加载速度。加载模型的代码修改为： ```python model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager") ```

Error executing method determine_num_available_blocks

可能是vllm使用的kernel的兼容性问题？可以尝试使用eager模式启动api看是否还会有同样问题（readme中的demo也是eager模式）：在命令行参数中加入`--enforce-eager`。

服务器部署问题

需要安装的python库包括：`torch`，`transformers`，`accelerate`。相关库不同版本的兼容性并没有详细测过，这里可以给一个我测试用的环境供参考，不一定需要严格符合： - `torch == 2.1.0` - `transformers == 4.39.3` - `accelerate == 0.29.3`

stack-heap-overflow

How to deploy in VLLM?

8 * A100 启动巨慢，有启动成功的勇士不

Error executing method determine_num_available_blocks

服务器部署问题