Purpose

This PR is intended to support the MiniMaxText01 model inference. It can run on a single machine with 8xH800 and 8xH20, where a single H800 machine can handle a maximum context input of 2 million tokens, and a single H20 machine can handle a maximum context input of 5 million tokens.

Modifications

Add the MiniMaxText01 model inference implementation, and a separate cache manager specifically for linear attention.
Adapt to the input consistent with the mamba model, including request_ids_to_seq_ids and finished_requests_ids.
Temporary Fix for the finished_requests_ids Issue in Consecutive Multi-Batch Inferences: This is a temporary solution for a specific problem, which likely involves state management during multi-batch inferences.

Deployment

Default Parameter Startup

python3 -m vllm.entrypoints.api_server \
--model ${MiniMaxText01-Molde-Path} \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8  \
--max_model_len 1000000 \
--dtype bfloat16

H800 TP8, maximum context length 2 million

python3 -m vllm.entrypoints.api_server \
--model ${MiniMax-Text-01-Path} \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8  \
--max_model_len 2048000 \
--gpu_memory_utilization 0.95 \
--max_num_seqs 1 \
--dtype bfloat16

H20 TP8, maximum context length 5 million

python -m vllm.entrypoints.api_server \
--model MiniMaxAI/MiniMax-Text-01 \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8 \
--max_model_len 5120000 \
--gpu_memory_utilization 0.95 \
--max_num_seqs 1 \
--dtype bfloat16

Feb 18 '25 03:02 ZZBoom

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Feb 18 '25 03:02 github-actions[bot]

Why do you introduce minimax_cache.py instead of reusing mamba_cache.py?

Feb 20 '25 14:02 heheda12345

Why do you introduce minimax_cache.py instead of reusing mamba_cache.py?

Because the internal data structure self.mamba_cache in mamba_cache.py is not suitable for the cache of MiniMaxText01 Model Linear Attn, and this parameter is coupled within the current_run_tensors method.

Feb 21 '25 06:02 ZZBoom

Could you please support the MiniMax VL model as well? I would greatly appreciate it

Feb 24 '25 07:02 zwc163

Sorry, this may be a silly question, but is the model used int8 quantized to achieve 2 million contexts using H800 TP8 inference？

Feb 25 '25 11:02 zifengdexiatian

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ZZBoom.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Feb 25 '25 14:02 mergify[bot]

Could you please support the MiniMax VL model as well? I would greatly appreciate it

@zwc163 Thank you for your attention. We do not have such a plan in the near future.

Feb 25 '25 14:02 ZZBoom

Sorry, this may be a silly question, but is the model used int8 quantized to achieve 2 million contexts using H800 TP8 inference？

@zifengdexiatian Two million tokens is not the goal. If you want to run this model on a single machine with 8xH800, you can only use int8 weight-only quantization or lower precision, and this two million is the maximum limit for running in this environment.

Feb 25 '25 14:02 ZZBoom

Sorry, this may be a silly question, but is the model used int8 quantized to achieve 2 million contexts using H800 TP8 inference？

@zifengdexiatian

Two million tokens is not the goal. If you want to run this model on a single machine with 8xH800, you can only use int8 weight-only quantization or lower precision, and this two million is the maximum limit for running in this environment.

Thanks for the answer, I understand that a single machine can only run the quantitative version, and can run a maximum of 2 million tokens at the same time.

Feb 25 '25 15:02 zifengdexiatian

@ZZBoom just checking - are there any blockers on this PR? I plan to review it but it's still marked as draft

Mar 04 '25 14:03 tlrmchlsmth

Is there any progress?

Mar 07 '25 02:03 shuxiaobo

Can you merge this please?

Mar 07 '25 10:03 tugot17

Hi @tlrmchlsmth Originally, I intended to make changes to the mamba cache. However, I noticed that they released mamba2 in the past two weeks, and mamba cache is already in use. After some consideration, we didn't want to affect the work of other teams, so we didn't modify their code. Nevertheless, we still extracted the common module "vllm/model_executor/models/constant_size_cache.py".

Mar 13 '25 17:03 qscqesze

@tlrmchlsmth Hi. I have fixed code and reply all comments. If you have time, could you help us review this code again? Thanks.

Mar 27 '25 02:03 qscqesze

This is a high-impact PR that could really benefit the project. Hoping it can be merged soon—thanks to everyone involved!

Mar 28 '25 01:03 rakshithvasudev

@tlrmchlsmth Hi. If you have time, could you help us review this code again?

Mar 29 '25 02:03 qscqesze

Some gsm8k evals on my end. Do these look good to you @qscqesze and @ZZBoom? (Using experts_int8 to fit on a single 8xA100 machine)

Running the following:

vllm serve MiniMaxAI/MiniMax-Text-01 \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8  \
--max_model_len 1000000 \
--dtype bfloat16

lm_eval --model local-completions --tasks gsm8k --model_args model=MiniMaxAI/MiniMax-Text-01,base_url=http://127.0.0.1:8000/v1/completions --limit 100

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.94|±  |0.0239|
|     |       |strict-match    |     5|exact_match|↑  | 0.94|±  |0.0239|

GSM8K results reported in https://huggingface.co/MiniMaxAI/MiniMax-Text-01#3-evaluation are 0.948, so this looks good to me, especially we'll be dropping accuracy a bit from quantization

Mar 30 '25 21:03 tlrmchlsmth

Adding ready to see how the mamba and hybrid integration tests do

Mar 30 '25 21:03 tlrmchlsmth

I had a couple more small questions and comments, but overall I think the PR is looking pretty good and ready to land once those are addressed.

Will there be a followup to simplify the weight loading?

Yes. We will simplify the weight loading in following work.

Mar 31 '25 02:03 qscqesze

Some gsm8k evals on my end. Do these look good to you @qscqesze and @ZZBoom? (Using experts_int8 to fit on a single 8xA100 machine)

Running the following:
vllm serve MiniMaxAI/MiniMax-Text-01 \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8  \
--max_model_len 1000000 \
--dtype bfloat16

lm_eval --model local-completions --tasks gsm8k --model_args model=MiniMaxAI/MiniMax-Text-01,base_url=http://127.0.0.1:8000/v1/completions --limit 100
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.94|±  |0.0239|
|     |       |strict-match    |     5|exact_match|↑  | 0.94|±  |0.0239|
GSM8K results reported in https://huggingface.co/MiniMaxAI/MiniMax-Text-01#3-evaluation are 0.948, so this looks good to me, especially we'll be dropping accuracy a bit from quantization

Yeah, this looks good to me and aligns with expectations.

Mar 31 '25 06:03 qscqesze

@tlrmchlsmth Hi! I believe our code passes all the tests except for [buildkite/ci/pr/v1-test], which failed due to a torch.OutOfMemoryError: CUDA out of memory. This issue doesn’t seem related to our code. Could you take a look and see if it’s ready to be merged?

Mar 31 '25 07:03 qscqesze

Hi @tlrmchlsmth . I’ve fixed the comments—thank you for the feedback! However, the test failed due to a missing image. Would you mind helping to restart the test? When you have a moment, could you also take another look at the code to see if it’s ready to be merged? Thanks again!

Apr 01 '25 02:04 qscqesze

Hi @tlrmchlsmth . I’ve fixed the comments—thank you for the feedback! However, the test failed due to a missing image. Would you mind helping to restart the test? When you have a moment, could you also take another look at the code to see if it’s ready to be merged? Thanks again!

I'll take another look at the code tomorrow morning! In the meantime I think you need to merge in main for the failing docker-build-image test (related to #14549)

Apr 01 '25 03:04 tlrmchlsmth

Hi @tlrmchlsmth . I’ve fixed the comments—thank you for the feedback! However, the test failed due to a missing image. Would you mind helping to restart the test? When you have a moment, could you also take another look at the code to see if it’s ready to be merged? Thanks again!

I'll take another look at the code tomorrow morning! In the meantime I think you need to merge in main for the failing docker-build-image test (related to #14549)

Thanks. I updated the branch already.

Apr 01 '25 06:04 qscqesze

vllm
vllm copied to clipboard

[Model][MiniMaxText01] Support MiniMaxText01 model inference

Purpose

Modifications

Deployment

vllm vllm copied to clipboard

[Model][MiniMaxText01] Support MiniMaxText01 model inference

Purpose

Modifications

Deployment

vllm
vllm copied to clipboard