Support XiaomiMiMo inference with mtp
Motivation
Support XiaomiMiMo inference with mtp
Modifications
Add new model support.
Checklist
- [x] Format your code according to the Code Formatting with Pre-Commit.
- [ ] Add unit tests as outlined in the Running Unit Tests.
- [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
TODO: Currently there is still some little problems about the MTP model service, DO NOT MERGE
Could you add the accuracy test result and add ci test?
For memory issue, you can ref https://github.com/sgl-project/sglang/blob/fba8eccd7ebe41bbdbf70ab3b6a2df1835f8b532/python/sglang/srt/model_executor/model_runner.py#L725 to make similar changes.
Could you add the accuracy test result and add ci test?
mtp-related ci test added
Great work
Could you share the result of python3 -m sglang.test.send_one?
Could you share the result of
python3 -m sglang.test.send_one?
Sure, thanks for pointing out this test. Result also pasted in description.
python3 -m sglang.test.send_one
w/ mtp
-------------------------------------------------------
acc_length=1.76
speed=103.04 token/s
-------------------------------------------------------
w/o mtp
-------------------------------------------------------
acc_length=1.00
speed=79.35 token/s
-------------------------------------------------------