Feature Request: MiniMax-Text-01 model
Prerequisites
- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the README.md.
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Please add support for minimax-text-01 model https://huggingface.co/MiniMaxAI/MiniMax-Text-01 https://github.com/MiniMax-AI/MiniMax-01
Motivation
We need to add support for the latest models! Performs almost as good as deepseek v3. But has 4 million tokens context.
Possible Implementation
It's a MoE model.
Very interested in this model!
I have something more or less working here: https://github.com/fairydreaming/llama.cpp/tree/minimax-text-01
Some major remaining problems:
- It currently doesn't support multiple token sequences. My current implementation of lightning attention simply ignores the token positions and sequence ids. Inference of a single token sequence should work fine.
- I guess that proper support of this model would require some redesign of KV cache. The problem is that layers in MiniMax-Text-01 can either use linear lightning attention (a single kv matrix per layer is cached in this case) or regular transformers attention that caches separate key and value vectors. I need to think about it some more.
I tested it on CPU (AMD Epyc 9374F, Q5_K_M), some token generation performance values:
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| minimax01 456B Q5_K - Medium | 302.51 GiB | 456.09 B | CPU | 32 | tg32@pp128 | 4.88 ± 0.05 |
| minimax01 456B Q5_K - Medium | 302.51 GiB | 456.09 B | CPU | 32 | tg32@pp256 | 4.51 ± 0.00 |
| minimax01 456B Q5_K - Medium | 302.51 GiB | 456.09 B | CPU | 32 | tg32@pp512 | 4.50 ± 0.00 |
| minimax01 456B Q5_K - Medium | 302.51 GiB | 456.09 B | CPU | 32 | tg32@pp1024 | 4.48 ± 0.00 |
| minimax01 456B Q5_K - Medium | 302.51 GiB | 456.09 B | CPU | 32 | tg32@pp2048 | 4.42 ± 0.00 |
| minimax01 456B Q5_K - Medium | 302.51 GiB | 456.09 B | CPU | 32 | tg32@pp4096 | 4.34 ± 0.00 |
| minimax01 456B Q5_K - Medium | 302.51 GiB | 456.09 B | CPU | 32 | tg32@pp8192 | 4.18 ± 0.00 |
I used my custom llama-bench test for testing generation rate at a given prompt length.
I guess that proper support of this model would require some redesign of KV cache. The problem is that layers in MiniMax-Text-01 can either use linear lightning attention (a single kv matrix per layer is cached in this case) or regular transformers attention that caches separate key and value vectors. I need to think about it some more.
Yup, it's unfeasible to keep trying to fit all variants of the attention into the existing KV cache code. I am hoping that after the refactoring of #11213 , we will be able to implement custom attention mechanism for use cases like these.
I noticed a problem with the model "eating" some words when asked to repeat text (Q5_K_M quant). Can someone with more RAM (like 512GB or 1TB) test this model with my branch? I'm not sure if the model is very sensitive to quantization or there is some other problem. The full prompt is:
<beginning_of_sentence>user name=user
Repeat this text: "The different accidents of life are not so changeable as the feelings of human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart."<end_of_sentence>
<beginning_of_sentence>ai name=assistant
while the model answer is:
The different accidents of life are not so changeable as the feelings human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart.<end_of_sentence>
There is one missing "of" in front of "human nature" and another "of" in front of "rest and health". Sometimes it eats "and" instead or both. A hungry model. I ran it with temp 0.01.
I'm curious if it happens also on f16 or Q8_0 quantization.
I have 1tb ram, I can try it
I found about llama_sbatch::split_equal, so my branch now supports inference of multiple token sequences with llama-server. Prompt caching should be disabled for now, it doesn't work correctly. Run the server with --jinja to use model prompt template.
@fairydreaming tested your branch with Q5_K_M. On my setup I see some missing "of". Tested on AMD EPYC and 768G RAM. Can you share your full command test to run? Building Q8 and will do test tomorrow...
build: 4532 (1e74c4d9) with gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Q5_K_M:
> <beginning_of_sentence>user name=user
Repeat this text: "The different accidents of life are not so changeable as the feelings of human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart."<end_of_sentence>
<beginning_of_sentence>ai name=assistant
"The different accidents life are not so changeable as the feelings human nature. I had worked hard for nearly two years, for the sole purpose of infusing life an inanimate body. For I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished the beauty of the dream vanished, and breathless horror and disgust filled my heart."
@fairydreaming tested your branch with Q5_K_M. On my setup I see some missing "of". Tested on AMD EPYC and 768G RAM. Can you share your full command test to run? Building Q8 and will do test tomorrow...
That would be helpful, thanks. Regarding the command line I can't access the workstation now, will add that later.
file format = GGUF V3 (latest) file type = Q8_0 file size = 451.36 GiB (8.50 BPW)
Full log:
compared by chatgpt
Summary of Rounds and Missing Words
Across the four rounds, the text provided by the user was analyzed for differences in word usage. Here's a concise summary of the missing words in each round and how they evolved:
Round 1:
Missing Words:
- "of" (in "The different accidents life are not so changeable as the feelings of human nature").
- "of" (in "For this I had deprived myself rest health").
- "and" (in "For this I had deprived myself rest health").
- "of" (in "the beauty the dream vanished").
- "and" (in "the beauty the dream vanished breathless horror").
Round 2:
Missing Words:
- "of" (in "as the feelings human nature").
- "of" (in "for the sole purpose infusing life").
Round 3:
- No Missing Words: The AI response matched the original text completely.
Round 4:
- No Missing Words: The AI response was identical to the original text.
Summary of All Missing Words:
From Rounds 1 and 2, the following words were missing:
- "of" (five occurrences in total across both rounds).
- "and" (two occurrences in Round 1).
In Rounds 3 and 4, no words were missing, indicating that the AI eventually reproduced the original text without errors.
@fairydreaming I found a possible issue with that, need to reconvert model again. see u soon
still the same issue, removed ignore_merges from llama-vocab.cpp and I did again a conversion and quant but no success. 'of' and and are still missing.
https://github.com/Nondzu/llama.cpp/commit/9ec337849645604062f84ba31610ac1001d3dcb8
log.txt
@Nondzu OK, if it happens on Q8_0 then likely there's still some problem with my inference code as I didn't observe this behavior via API in OpenRouter. Thanks for testing!
This issue was closed because it has been inactive for 14 days since being marked as stale.
rip
this means that MiniMax-Text-01 will never come to llama.cpp?
@ehartford maybe someone needs to open a new issue
Hi, how is the progress on this model going? I noticed there seemed to be some issues earlier, and I tried to reproduce them on my side. When I deployed the model without any quantization or precision reduction, it produced the expected results.
Is there anything I can help with? I’d like to contribute by integrating this model into llama.cpp. From what I understand, quantization currently leads to incorrect outputs—is that right?
I will continue this work in PR #13889 .