若可
若可
The rendering of `\n\n\n\n` in chat_template seems to prevent the ReasoningParser from detecting the `` reasoning end token, causing it to mistakenly remain in Reasoning Stage. The current DeepSeekR1ReasoningParser appears...
Noticing this PR https://github.com/vllm-project/vllm/pull/17369 I'd like to point out that while this Qwen3ReasoningParser can already handle most cases, there is still one scenario it doesn't resolve. Consider this situation: When...
> 直接调用阿里百炼平台的qwen3-32b,"chat_template_kwargs": {"enable_thinking": false}参数也表现不对劲,和vllm不一样,我发现直接平台调用时,该参数干脆就不生效,也可以使用上述方式解决。不过再次强调,这都是临时方案。英文能力不好,能用阿里百炼的应该都看得懂中文,这段就不写英文了。 非官方,个人猜测,当"chat_template_kwargs": {"enable_thinking": false}生效时候,会完全没有``内容,然后被deepseek_r1的parser错误识别为reasoning_content。 而使用/nothink模式时候,会存在一个空的类似`\n`内容,从而让content不会被错误识别为reasoning_content The logic can be found in `tokenizer_config.json` ``` {%- if add_generation_prompt %} {{- 'assistant\n' }} {%- if enable_thinking is defined and enable_thinking...
Encountered the same problem. But I'm using `vllm serve` to deploy `DeepSeek-R1-AWQ`. **Environment** - Image: cuda:12.6.0-cudnn-devel-ubuntu22.04 - GPUs: A800 x 8 - Python 3.10 - vLLM 0.7.2 - torch 2.5.1...
After reading the source code, I have the following findings. **V0 Implement** Function `get_token_bin_counts_and_mask`(`vllm.model_executor.layers.utils.py, line8`), the shape of `bin_counts` is `[num_seq, vocab_size + 1]` The `num_seqs` is `logits.shape[0]` The `tokens`...
I can apply a temporary fix by directly modifying the code to address this error. However, I am uncertain about the correctness of this approach, so this suggestion is provided...
Using https://huggingface.co/ModelCloud/GLM-4.6-REAP-268B-A32B-GPTQMODEL-W4A16 with sglang v0.5.5.post1, A800 x 4 (--tp 4), got same issue. EDIT: I try to use `Qwen/Qwen3-30B-A3B-GPTQ-Int4` with A800. - deploy with single A800(no TP), running successfully. -...
After testing, I've made a hotfix by suppressing the error-causing operation: ``` with contextlib.suppress(Exception): loaded_weight = loaded_weight.narrow( shard_dim, shard_size * tp_rank, shard_size ) ``` This allows normal weight loading under...
I tried downgrading to SGLang v0.5.3, and re-running the command executed successfully. ``` sglang==0.5.3 sgl-kernel==0.3.14.post1 ``` `--tp 2`, `--tp 2 --quantization moe_wna16` all work without issues. `--tp 2 --ep 2`...
It looks like your vLLM is out of date. Try upgrade to `vllm==0.6.4.post1` and generate again?