vllm
vllm copied to clipboard
[Model][MiniMaxText01] Support MiniMaxText01 model inference
Purpose
This PR is intended to support the MiniMaxText01 model inference. It can run on a single machine with 8xH800 and 8xH20, where a single H800 machine can handle a maximum context input of 2 million tokens, and a single H20 machine can handle a maximum context input of 5 million tokens.
Modifications
- Add the MiniMaxText01 model inference implementation, and a separate cache manager specifically for linear attention.
- Adapt to the input consistent with the mamba model, including
request_ids_to_seq_idsandfinished_requests_ids. - Temporary Fix for the
finished_requests_idsIssue in Consecutive Multi-Batch Inferences: This is a temporary solution for a specific problem, which likely involves state management during multi-batch inferences.
Deployment
Default Parameter Startup
python3 -m vllm.entrypoints.api_server \
--model ${MiniMaxText01-Molde-Path} \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8 \
--max_model_len 1000000 \
--dtype bfloat16
H800 TP8, maximum context length 2 million
python3 -m vllm.entrypoints.api_server \
--model ${MiniMax-Text-01-Path} \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8 \
--max_model_len 2048000 \
--gpu_memory_utilization 0.95 \
--max_num_seqs 1 \
--dtype bfloat16
H20 TP8, maximum context length 5 million
python -m vllm.entrypoints.api_server \
--model MiniMaxAI/MiniMax-Text-01 \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8 \
--max_model_len 5120000 \
--gpu_memory_utilization 0.95 \
--max_num_seqs 1 \
--dtype bfloat16
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
Why do you introduce minimax_cache.py instead of reusing mamba_cache.py?
Why do you introduce
minimax_cache.pyinstead of reusingmamba_cache.py?
Because the internal data structure self.mamba_cache in mamba_cache.py is not suitable for the cache of MiniMaxText01 Model Linear Attn, and this parameter is coupled within the current_run_tensors method.
Could you please support the MiniMax VL model as well? I would greatly appreciate it
Sorry, this may be a silly question, but is the model used int8 quantized to achieve 2 million contexts using H800 TP8 inference?
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ZZBoom.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Could you please support the MiniMax VL model as well? I would greatly appreciate it
@zwc163 Thank you for your attention. We do not have such a plan in the near future.
Sorry, this may be a silly question, but is the model used int8 quantized to achieve 2 million contexts using H800 TP8 inference?
@zifengdexiatian Two million tokens is not the goal. If you want to run this model on a single machine with 8xH800, you can only use int8 weight-only quantization or lower precision, and this two million is the maximum limit for running in this environment.
Sorry, this may be a silly question, but is the model used int8 quantized to achieve 2 million contexts using H800 TP8 inference?
@zifengdexiatian
Two million tokens is not the goal. If you want to run this model on a single machine with 8xH800, you can only use int8 weight-only quantization or lower precision, and this two million is the maximum limit for running in this environment.
Thanks for the answer, I understand that a single machine can only run the quantitative version, and can run a maximum of 2 million tokens at the same time.
@ZZBoom just checking - are there any blockers on this PR? I plan to review it but it's still marked as draft
Is there any progress?
Can you merge this please?
Hi @tlrmchlsmth Originally, I intended to make changes to the mamba cache. However, I noticed that they released mamba2 in the past two weeks, and mamba cache is already in use. After some consideration, we didn't want to affect the work of other teams, so we didn't modify their code. Nevertheless, we still extracted the common module "vllm/model_executor/models/constant_size_cache.py".
@tlrmchlsmth Hi. I have fixed code and reply all comments. If you have time, could you help us review this code again? Thanks.
This is a high-impact PR that could really benefit the project. Hoping it can be merged soon—thanks to everyone involved!
@tlrmchlsmth Hi. If you have time, could you help us review this code again?
Some gsm8k evals on my end. Do these look good to you @qscqesze and @ZZBoom? (Using experts_int8 to fit on a single 8xA100 machine)
Running the following:
vllm serve MiniMaxAI/MiniMax-Text-01 \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8 \
--max_model_len 1000000 \
--dtype bfloat16
lm_eval --model local-completions --tasks gsm8k --model_args model=MiniMaxAI/MiniMax-Text-01,base_url=http://127.0.0.1:8000/v1/completions --limit 100
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.94|± |0.0239|
| | |strict-match | 5|exact_match|↑ | 0.94|± |0.0239|
GSM8K results reported in https://huggingface.co/MiniMaxAI/MiniMax-Text-01#3-evaluation are 0.948, so this looks good to me, especially we'll be dropping accuracy a bit from quantization
Adding ready to see how the mamba and hybrid integration tests do
I had a couple more small questions and comments, but overall I think the PR is looking pretty good and ready to land once those are addressed.
Will there be a followup to simplify the weight loading?
Yes. We will simplify the weight loading in following work.
Some gsm8k evals on my end. Do these look good to you @qscqesze and @ZZBoom? (Using experts_int8 to fit on a single 8xA100 machine)
Running the following:
vllm serve MiniMaxAI/MiniMax-Text-01 \ --tensor-parallel-size 8 \ --trust-remote-code \ --quantization experts_int8 \ --max_model_len 1000000 \ --dtype bfloat16 lm_eval --model local-completions --tasks gsm8k --model_args model=MiniMaxAI/MiniMax-Text-01,base_url=http://127.0.0.1:8000/v1/completions --limit 100|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.94|± |0.0239| | | |strict-match | 5|exact_match|↑ | 0.94|± |0.0239|GSM8K results reported in https://huggingface.co/MiniMaxAI/MiniMax-Text-01#3-evaluation are 0.948, so this looks good to me, especially we'll be dropping accuracy a bit from quantization
Yeah, this looks good to me and aligns with expectations.
@tlrmchlsmth Hi! I believe our code passes all the tests except for [buildkite/ci/pr/v1-test], which failed due to a torch.OutOfMemoryError: CUDA out of memory. This issue doesn’t seem related to our code. Could you take a look and see if it’s ready to be merged?
Hi @tlrmchlsmth . I’ve fixed the comments—thank you for the feedback! However, the test failed due to a missing image. Would you mind helping to restart the test? When you have a moment, could you also take another look at the code to see if it’s ready to be merged? Thanks again!
Hi @tlrmchlsmth . I’ve fixed the comments—thank you for the feedback! However, the test failed due to a missing image. Would you mind helping to restart the test? When you have a moment, could you also take another look at the code to see if it’s ready to be merged? Thanks again!
I'll take another look at the code tomorrow morning! In the meantime I think you need to merge in main for the failing docker-build-image test (related to #14549)
Hi @tlrmchlsmth . I’ve fixed the comments—thank you for the feedback! However, the test failed due to a missing image. Would you mind helping to restart the test? When you have a moment, could you also take another look at the code to see if it’s ready to be merged? Thanks again!
I'll take another look at the code tomorrow morning! In the meantime I think you need to merge in main for the failing
docker-build-imagetest (related to #14549)
Thanks. I updated the branch already.