Leyang Xue
Leyang Xue
We haven't tested yet, but given Python3 is is backward compatible, it should work. You might need to buiild wheel by yourself form source using `BUILD_OPS=1 python3 -m build`
I would assume you run on main branch, feature/qwen can be faster but a bit less stable. See also #64
You have include the prefill time as in decoding throughput which is not correct, TTFT needs to be excluded. See [StopWatch](https://github.com/EfficientMoE/MoE-Infinity/blob/main/examples/interface_example.py) for example
on both systems the results seems to be counter-intuitive to me. For MoE-Infinity, which commit you are building on? since there are some updates recently. Would you help me to...
Predictor is not applied since the current version in python has too much overhead, better use cache only. I am currently to working on that. Thanks for the instruction for...
> Has there been any progress on this issue? thank you There is a big gap on kernel implementation comparing to SOTA like vLLM, SGLang, or Ollama. We are looking...