mlc-llm icon indicating copy to clipboard operation
mlc-llm copied to clipboard

[Bug] Speculative decoding not working due to difference in vocab_size (Qwen2.5 serie)

Open glennhanks opened this issue 6 months ago • 1 comments

🐛 Bug

I tried to run Qwen2.5-Math-72B-Instruct-q4f16_1-MLC using Qwen2.5-Math-1.5B-Instruct-q4f16_1-MLC as draft model, but with problem. It is strange that it outputs the first word and then stops. As I checked the log, the reason is that TVMError: Check failed: logits->shape[1] == vocab_size_ (151936 vs. 152064)

So I wonder if there is a quick fix. I am sorry that I am a layman to llm inference and no idea about how to solve technical problems.

To Reproduce

Steps to reproduce the behavior:

mlc_llm serve /mnt/windows/AAA/Qwen2.5-Math-72B-Instruct-q4f16_1-MLC --mode server --overrides "max_num_sequence=32;max_total_seq_length=4096;tensor_parallel_shards=2" --additional-models /mnt/windows/AAA/Qwen2.5-Math-1.5B-Instruct-q4f16_1-MLC --speculative-mode small_draft

Expected behavior

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): ROCM 6.2.2, (GPU: MI50)
  • Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) Desktop
  • How you installed MLC-LLM (conda, source): yes
  • How you installed TVM-Unity (pip, source): not sure
  • Python version (e.g. 3.10): Python 3.12.9
  • GPU driver version (if applicable): ROCM 6.2.2
  • CUDA/cuDNN version (if applicable):
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
  • Any other relevant information:

Additional context

glennhanks avatar Apr 22 '25 14:04 glennhanks