Leyang Xue comments

Results 16 comments of


                                            Leyang Xue

Does it support other DeepSeek models?

We haven't tested yet, but given Python3 is is backward compatible, it should work. You might need to buiild wheel by yourself form source using `BUILD_OPS=1 python3 -m build`

what differences Between the GitHub Open-Source Version and the Paper Implementation of DeepSeek-Chat-Lite

I would assume you run on main branch, feature/qwen can be faster but a bit less stable. See also #64

[Feature Request]How to measure the generation throughput(token/s)?

You have include the prefill time as in decoding throughput which is not correct, TTFT needs to be excluded. See [StopWatch](https://github.com/EfficientMoE/MoE-Infinity/blob/main/examples/interface_example.py) for example

[Feature Request]How to measure the generation throughput(token/s)?

on both systems the results seems to be counter-intuitive to me. For MoE-Infinity, which commit you are building on? since there are some updates recently. Would you help me to...

[Feature Request]How to measure the generation throughput(token/s)?

Predictor is not applied since the current version in python has too much overhead, better use cache only. I am currently to working on that. Thanks for the instruction for...

[Feature Request]How to measure the generation throughput(token/s)?

> Has there been any progress on this issue? thank you There is a big gap on kernel implementation comparing to SOTA like vLLM, SGLang, or Ollama. We are looking...