MoE-Infinity what differences Between the GitHub Open-Source Version and the Paper Implementation of DeepSeek-Chat-Lite

Thank you for your work.

May I ask what the differences are between the open-source code on GitHub and the version described in the paper?
I tested deepseek-chat-lite in bigbench, and the throughput is around 280 ms, whereas the paper reports 155 ms. I’d like to know where the differences come from.

test script is： CUDA_VISIBLE_DEVICES=0 python examples/interface_example.py --model_name_or_path /app/data/DeepSeek-V2-Lite-Chat --offload_dir /root/moe-infinity --device_memory_ratio 0.75 --out_len 32

Many thanks!

Oct 14 '25 02:10 dnnyyq

I would assume you run on main branch, feature/qwen can be faster but a bit less stable. See also #64

Oct 14 '25 12:10 drunkcoding

Thank you for your reply.
I’m using the main branch. Has the FlashInfer feature already been integrated?
If not, I’m looking forward to its integration.

I’m even more looking forward to having the moe-inf feature integrated into vLLM.

Oct 15 '25 02:10 dnnyyq