hediyuan comments

Results 6 comments of


                                            hediyuan

qwen模型回复缺字

而且我试了下，我做代码检查类问题时，面对英文字很多的情况，qwen还是会重复输出，前几句还正常，后面就一直重复最后一句了。。好像长文本也会这样重复输出，求大佬们解惑下帮帮忙 ^_^

> 1. fastllm 目前在空余显存 > 1GB时，问答中释放显存的环节会自动执行释放操作。（commit [8286d1d](https://github.com/ztxz16/fastllm/commit/8286d1dfca93cd92cfaf5523046ecdab6714b235)） > 2. 根据pybind11的设计，在执行 `model = pyfastllm.create_llm(model_path)`后， `model` 被python托管，可以自己处理，并在退出解释器时释放。谢谢您的解答，现在有一个问题是当我的服务收到一个长度很长的问题时，推理过程中可能会爆显存，一旦爆显存fastllm就不会释放了，并且后续问答都会报错，所以我想知道有没有什么方法可以主动释放显存，我想在报错时捕获异常进行主动释放，这样就不会影响后续使用了

是否有释放模型的接口？

同求，模型在进程中加载一次后，我想要主动释放掉，不知道该怎么操作，模型会一直加载在显存里

[Installation]: Could not find a version that satisfies the requirement xgrammar>=0.1.6; platform_machine == "x86_64" (from vllm) (from versions: none)

+1 i try to build xgrammar instead of whl，but it requires higher versions of cmake and gcc

[Bug]: RuntimeError: Error in model execution: CUDA error: an illegal memory access was encountered

+1 same bug

[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models

> For errors caused by excessive tensor parallelism, you can set --enable-expert-parallel. Refer to: [#17327](https://github.com/vllm-project/vllm/issues/17327) it works! thank u!

hediyuan

qwen模型回复缺字

pyfastllm有释放GPU显存的接口吗？

是否有释放模型的接口？

[Installation]: Could not find a version that satisfies the requirement xgrammar>=0.1.6; platform_machine == "x86_64" (from vllm) (from versions: none)

[Bug]: RuntimeError: Error in model execution: CUDA error: an illegal memory access was encountered

[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models