mlc-llm
mlc-llm copied to clipboard
[Bug] Speculative decoding not working due to difference in vocab_size (Qwen2.5 serie)
🐛 Bug
I tried to run Qwen2.5-Math-72B-Instruct-q4f16_1-MLC using Qwen2.5-Math-1.5B-Instruct-q4f16_1-MLC as draft model, but with problem. It is strange that it outputs the first word and then stops. As I checked the log, the reason is that TVMError: Check failed: logits->shape[1] == vocab_size_ (151936 vs. 152064)
So I wonder if there is a quick fix. I am sorry that I am a layman to llm inference and no idea about how to solve technical problems.
To Reproduce
Steps to reproduce the behavior:
mlc_llm serve /mnt/windows/AAA/Qwen2.5-Math-72B-Instruct-q4f16_1-MLC --mode server --overrides "max_num_sequence=32;max_total_seq_length=4096;tensor_parallel_shards=2" --additional-models /mnt/windows/AAA/Qwen2.5-Math-1.5B-Instruct-q4f16_1-MLC --speculative-mode small_draft
Expected behavior
Environment
- Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): ROCM 6.2.2, (GPU: MI50)
- Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04
- Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) Desktop
- How you installed MLC-LLM (
conda, source): yes - How you installed TVM-Unity (
pip, source): not sure - Python version (e.g. 3.10): Python 3.12.9
- GPU driver version (if applicable): ROCM 6.2.2
- CUDA/cuDNN version (if applicable):
- TVM Unity Hash Tag (
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): - Any other relevant information: