MiniCPM Training instability with InfLLM v2 on MiniCPM4-8B using sparse attention

I'm encountering training instability when attempting to fine-tune MiniCPM4-8B using InfLLM v2 with the provided sparse configuration. The training collapses immediately with gradient norm NaN at the first optimization step and zero loss at the second step.

Environment:

Model: MiniCPM4-8B
Transformers: 4.53.0
Pytorch: 2.5.0
Cuda: 12.6
Attention: InfLLM v2
Hardware: $8\times H200$
DeepSpeed: Zero Stage 3

Configuration:

"sparse_config": {
    "kernel_size": 32,
    "kernel_stride": 16,
    "init_blocks": 1,
    "block_size": 64,
    "window_size": 2048,
    "topk": 64,
    "use_nope": false,
    "dense_len": 8192
}

Dataset: slimpajama-per-source-length-upsample
Learning rate: 2e-5
Warmup: 20 steps (linear)
Weight decay: 0
Batch size: $1\times 8 \times 8$ (per GPU × num GPUs × accumulation steps)

Issue:

First training step immediately results in gradient norm NaN

Second step shows loss dropping to 0

Training collapses completely

Note: Full attention fine-tuning works stably with the same configuration

Expected Behavior: Stable training similar to full attention mode.

Reproduction Steps:

Initialize MiniCPM4-8B with InfLLM v2 sparse config

Prepare slimpajama dataset

Start training with specified hyperparameters

Observe immediate NaN in gradients

Jul 04 '25 03:07 bolixinyu

May I ask which training framework are you using? Did you train Minicpm4 with any parallel strategies such as tensor parallelism or sequence parallelism? Also we open sourced our training code with llama factory, maybe you could start with that.

Aug 04 '25 05:08 suhmily10

@bolixinyu Have you succeeded in training InfLLM v2?

Sep 02 '25 11:09 chaiyixuan

May I ask which training framework are you using? Did you train Minicpm4 with any parallel strategies such as tensor parallelism or sequence parallelism? Also we open sourced our training code with llama factory, maybe you could start with that.

@suhmily10 Thanks for your reply, we will try it with Llama-Factory later.

Sep 02 '25 11:09 bolixinyu

@bolixinyu Have you succeeded in training InfLLM v2?

@chaiyixuan We haven't had the chance to fix the code yet. Have you tried training it with Llama-Factory? I'd be interested to hear if it works for you.

Sep 02 '25 11:09 bolixinyu

This is not my responsibility so kindly forward it to software testers department specifically mentioned by me so i dont have right to ask me this follies question kindly behave like junior as i am senior engineer

On Tue, 2 Sep 2025 at 4:54 PM, bolixinyu @.***> wrote:

bolixinyu left a comment (OpenBMB/MiniCPM#308) https://github.com/OpenBMB/MiniCPM/issues/308#issuecomment-3244914154

@bolixinyu https://github.com/bolixinyu Have you succeeded in training InfLLM v2?

We do not have enough time to fix the code. Have you tried to train it with Llama-Factory?

— Reply to this email directly, view it on GitHub https://github.com/OpenBMB/MiniCPM/issues/308#issuecomment-3244914154, or unsubscribe https://github.com/notifications/unsubscribe-auth/BLHMSJJ2LFQTZ5FZSKH7VZT3QV45FAVCNFSM6AAAAACAYK77D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTENBUHEYTIMJVGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Sep 03 '25 06:09 saurabh12453

This is not my responsibility so kindly forward it to software testers department specifically mentioned by me so i dont have right to ask me this follies question kindly behave like junior as i am senior engineer …

@saurabh12453 I was simply sharing context and asking a fellow community member if they had tried an alternative solution. This was a collaborative discussion, not an assignment of responsibility. Your response is unwarranted and unprofessional. Let’s keep the conversation respectful and constructive.

Sep 03 '25 06:09 bolixinyu