MoE-Infinity icon indicating copy to clipboard operation
MoE-Infinity copied to clipboard

what differences Between the GitHub Open-Source Version and the Paper Implementation of DeepSeek-Chat-Lite

Open dnnyyq opened this issue 2 months ago • 2 comments

Thank you for your work.

May I ask what the differences are between the open-source code on GitHub and the version described in the paper?
I tested deepseek-chat-lite in bigbench, and the throughput is around 280 ms, whereas the paper reports 155 ms. I’d like to know where the differences come from.

test script is: CUDA_VISIBLE_DEVICES=0 python examples/interface_example.py --model_name_or_path /app/data/DeepSeek-V2-Lite-Chat --offload_dir /root/moe-infinity --device_memory_ratio 0.75 --out_len 32

Many thanks!

dnnyyq avatar Oct 14 '25 02:10 dnnyyq

I would assume you run on main branch, feature/qwen can be faster but a bit less stable. See also #64

drunkcoding avatar Oct 14 '25 12:10 drunkcoding

Thank you for your reply.
I’m using the main branch. Has the FlashInfer feature already been integrated?
If not, I’m looking forward to its integration.

I’m even more looking forward to having the moe-inf feature integrated into vLLM.

dnnyyq avatar Oct 15 '25 02:10 dnnyyq