Training instability with InfLLM v2 on MiniCPM4-8B using sparse attention
I'm encountering training instability when attempting to fine-tune MiniCPM4-8B using InfLLM v2 with the provided sparse configuration. The training collapses immediately with gradient norm NaN at the first optimization step and zero loss at the second step.
Environment:
- Model: MiniCPM4-8B
- Transformers: 4.53.0
- Pytorch: 2.5.0
- Cuda: 12.6
- Attention: InfLLM v2
- Hardware: $8\times H200$
- DeepSpeed: Zero Stage 3
Configuration:
"sparse_config": {
"kernel_size": 32,
"kernel_stride": 16,
"init_blocks": 1,
"block_size": 64,
"window_size": 2048,
"topk": 64,
"use_nope": false,
"dense_len": 8192
}
-
Dataset: slimpajama-per-source-length-upsample
-
Learning rate: 2e-5
-
Warmup: 20 steps (linear)
-
Weight decay: 0
-
Batch size: $1\times 8 \times 8$ (per GPU × num GPUs × accumulation steps)
Issue:
First training step immediately results in gradient norm NaN
Second step shows loss dropping to 0
Training collapses completely
Note: Full attention fine-tuning works stably with the same configuration
Expected Behavior: Stable training similar to full attention mode.
Reproduction Steps:
Initialize MiniCPM4-8B with InfLLM v2 sparse config
Prepare slimpajama dataset
Start training with specified hyperparameters
Observe immediate NaN in gradients
May I ask which training framework are you using? Did you train Minicpm4 with any parallel strategies such as tensor parallelism or sequence parallelism? Also we open sourced our training code with llama factory, maybe you could start with that.
@bolixinyu Have you succeeded in training InfLLM v2?
May I ask which training framework are you using? Did you train Minicpm4 with any parallel strategies such as tensor parallelism or sequence parallelism? Also we open sourced our training code with llama factory, maybe you could start with that.
@suhmily10 Thanks for your reply, we will try it with Llama-Factory later.
@bolixinyu Have you succeeded in training InfLLM v2?
@chaiyixuan We haven't had the chance to fix the code yet. Have you tried training it with Llama-Factory? I'd be interested to hear if it works for you.
This is not my responsibility so kindly forward it to software testers department specifically mentioned by me so i dont have right to ask me this follies question kindly behave like junior as i am senior engineer
On Tue, 2 Sep 2025 at 4:54 PM, bolixinyu @.***> wrote:
bolixinyu left a comment (OpenBMB/MiniCPM#308) https://github.com/OpenBMB/MiniCPM/issues/308#issuecomment-3244914154
@bolixinyu https://github.com/bolixinyu Have you succeeded in training InfLLM v2?
We do not have enough time to fix the code. Have you tried to train it with Llama-Factory?
— Reply to this email directly, view it on GitHub https://github.com/OpenBMB/MiniCPM/issues/308#issuecomment-3244914154, or unsubscribe https://github.com/notifications/unsubscribe-auth/BLHMSJJ2LFQTZ5FZSKH7VZT3QV45FAVCNFSM6AAAAACAYK77D6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTENBUHEYTIMJVGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
This is not my responsibility so kindly forward it to software testers department specifically mentioned by me so i dont have right to ask me this follies question kindly behave like junior as i am senior engineer …
@saurabh12453 I was simply sharing context and asking a fellow community member if they had tried an alternative solution. This was a collaborative discussion, not an assignment of responsibility. Your response is unwarranted and unprofessional. Let’s keep the conversation respectful and constructive.