llama.cpp Feature Request: MiniMax-Text-01 model

Prerequisites

[x] I am running the latest code. Mention the version if possible as well.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Please add support for minimax-text-01 model https://huggingface.co/MiniMaxAI/MiniMax-Text-01 https://github.com/MiniMax-AI/MiniMax-01

Motivation

We need to add support for the latest models! Performs almost as good as deepseek v3. But has 4 million tokens context.

Possible Implementation

It's a MoE model.

Jan 18 '25 15:01 Kreijstal

Very interested in this model!

Jan 20 '25 19:01 ehartford

I have something more or less working here: https://github.com/fairydreaming/llama.cpp/tree/minimax-text-01

Some major remaining problems:

It currently doesn't support multiple token sequences. My current implementation of lightning attention simply ignores the token positions and sequence ids. Inference of a single token sequence should work fine.
I guess that proper support of this model would require some redesign of KV cache. The problem is that layers in MiniMax-Text-01 can either use linear lightning attention (a single kv matrix per layer is cached in this case) or regular transformers attention that caches separate key and value vectors. I need to think about it some more.

I tested it on CPU (AMD Epyc 9374F, Q5_K_M), some token generation performance values:

model	size	params	backend	threads	test	t/s
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp128	4.88 ± 0.05
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp256	4.51 ± 0.00
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp512	4.50 ± 0.00
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp1024	4.48 ± 0.00
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp2048	4.42 ± 0.00
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp4096	4.34 ± 0.00
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp8192	4.18 ± 0.00

I used my custom llama-bench test for testing generation rate at a given prompt length.

Jan 21 '25 20:01 fairydreaming

I guess that proper support of this model would require some redesign of KV cache. The problem is that layers in MiniMax-Text-01 can either use linear lightning attention (a single kv matrix per layer is cached in this case) or regular transformers attention that caches separate key and value vectors. I need to think about it some more.

Yup, it's unfeasible to keep trying to fit all variants of the attention into the existing KV cache code. I am hoping that after the refactoring of #11213 , we will be able to implement custom attention mechanism for use cases like these.

Jan 22 '25 07:01 ggerganov

I noticed a problem with the model "eating" some words when asked to repeat text (Q5_K_M quant). Can someone with more RAM (like 512GB or 1TB) test this model with my branch? I'm not sure if the model is very sensitive to quantization or there is some other problem. The full prompt is:

<beginning_of_sentence>user name=user
Repeat this text: "The different accidents of life are not so changeable as the feelings of human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart."<end_of_sentence>
<beginning_of_sentence>ai name=assistant

while the model answer is:

The different accidents of life are not so changeable as the feelings human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart.<end_of_sentence>

There is one missing "of" in front of "human nature" and another "of" in front of "rest and health". Sometimes it eats "and" instead or both. A hungry model. I ran it with temp 0.01.

I'm curious if it happens also on f16 or Q8_0 quantization.

Jan 22 '25 17:01 fairydreaming

I have 1tb ram, I can try it

Jan 22 '25 17:01 ehartford

I found about llama_sbatch::split_equal, so my branch now supports inference of multiple token sequences with llama-server. Prompt caching should be disabled for now, it doesn't work correctly. Run the server with --jinja to use model prompt template.

Jan 24 '25 11:01 fairydreaming

@fairydreaming tested your branch with Q5_K_M. On my setup I see some missing "of". Tested on AMD EPYC and 768G RAM. Can you share your full command test to run? Building Q8 and will do test tomorrow...

build: 4532 (1e74c4d9) with gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Q5_K_M:

> <beginning_of_sentence>user name=user  
Repeat this text: "The different accidents of life are not so changeable as the feelings of human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart."<end_of_sentence>  
<beginning_of_sentence>ai name=assistant  

"The different accidents life are not so changeable as the feelings human nature. I had worked hard for nearly two years, for the sole purpose of infusing life an inanimate body. For I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished the beauty of the dream vanished, and breathless horror and disgust filled my heart."

Jan 25 '25 21:01 Nondzu

@fairydreaming tested your branch with Q5_K_M. On my setup I see some missing "of". Tested on AMD EPYC and 768G RAM. Can you share your full command test to run? Building Q8 and will do test tomorrow...

That would be helpful, thanks. Regarding the command line I can't access the workstation now, will add that later.

Jan 25 '25 21:01 fairydreaming

file format = GGUF V3 (latest) file type = Q8_0 file size = 451.36 GiB (8.50 BPW)

Full log:

minimax-q8.log

compared by chatgpt

Summary of Rounds and Missing Words

Across the four rounds, the text provided by the user was analyzed for differences in word usage. Here's a concise summary of the missing words in each round and how they evolved:

Round 1:

Missing Words:

"of" (in "The different accidents life are not so changeable as the feelings of human nature").
"of" (in "For this I had deprived myself rest health").
"and" (in "For this I had deprived myself rest health").
"of" (in "the beauty the dream vanished").
"and" (in "the beauty the dream vanished breathless horror").

Round 2:

Missing Words:

"of" (in "as the feelings human nature").
"of" (in "for the sole purpose infusing life").

Round 3:

No Missing Words: The AI response matched the original text completely.

Round 4:

No Missing Words: The AI response was identical to the original text.

Summary of All Missing Words:

From Rounds 1 and 2, the following words were missing:

"of" (five occurrences in total across both rounds).
"and" (two occurrences in Round 1).

In Rounds 3 and 4, no words were missing, indicating that the AI eventually reproduced the original text without errors.

Jan 26 '25 05:01 Nondzu

@fairydreaming I found a possible issue with that, need to reconvert model again. see u soon

Jan 26 '25 05:01 Nondzu

still the same issue, removed ignore_merges from llama-vocab.cpp and I did again a conversion and quant but no success. 'of' and and are still missing. https://github.com/Nondzu/llama.cpp/commit/9ec337849645604062f84ba31610ac1001d3dcb8 log.txt

Jan 26 '25 08:01 Nondzu

@Nondzu OK, if it happens on Q8_0 then likely there's still some problem with my inference code as I didn't observe this behavior via API in OpenRouter. Thanks for testing!

Jan 26 '25 09:01 fairydreaming

This issue was closed because it has been inactive for 14 days since being marked as stale.

Mar 12 '25 01:03 github-actions[bot]

rip

Mar 12 '25 07:03 Kreijstal

this means that MiniMax-Text-01 will never come to llama.cpp?

Mar 19 '25 16:03 ehartford

@ehartford maybe someone needs to open a new issue

Mar 19 '25 16:03 Kreijstal

Hi, how is the progress on this model going? I noticed there seemed to be some issues earlier, and I tried to reproduce them on my side. When I deployed the model without any quantization or precision reduction, it produced the expected results.

Is there anything I can help with? I’d like to contribute by integrating this model into llama.cpp. From what I understand, quantization currently leads to incorrect outputs—is that right?

I will continue this work in PR #13889 .

May 28 '25 13:05 qscqesze