Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION] MOE training meet abnormal gradient norm and loss

Open bugm opened this issue 9 months ago • 1 comments

Hello, I am training a MOE model (16B total and 2.5B activated) and below are some tensorboard logs,

grad norm Image

Image

Lm loss Image

Image

load_balance_loss Image

Image

as you can see, as training going on , loss and gradient turns abnormal,here are some key arguments

the ordinary arguments

--lr 4.2e-4
--min-lr 4.2e-5
--lr-decay-style cosine
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.95
--clip-grad 1.0
--init-method-std 0.006
--attention-dropout 0.0
--hidden-dropout 0.0
--lr-decay-iters 381469
--lr-warmup-iters 2000
--train-iters 381469
--micro-batch-size 2
--global-batch-size 4800
--num-layers 28
--hidden-size 2048
--num-attention-heads 16
--ffn-hidden-size 10944
--seq-length 4096
--max-position-embeddings 4096
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 2
--context-parallel-size 1
 --swiglu
 --normalization RMSNorm
 --norm-epsilon 1e-6
 --use-rotary-position-embeddings
 --no-bias-swiglu-fusion
 --no-rope-fusion
 --position-embedding-type rope
 --untie-embeddings-and-output-weights
 --rotary-base 10000
 --rotary-scaling-factor 40
 --kv-channels 128
 --bf16

arguments related to MOE

 --qk-layernorm
 --multi-latent-attention
 --transformer-impl transformer_engine
 --use-distributed-optimizer
 --attention-backend flash
 --moe-ffn-hidden-size 1408
 --moe-router-topk 6
 --num-experts 64
 --moe-layer-freq 1
 --moe-first-k-dense-replace 1
 --moe-aux-loss-coeff 0.001
 --moe-shared-expert-intermediate-size 2816
 --expert-model-parallel-size 8
 --kv-lora-rank 512
 --qk-head-dim 128
 --qk-pos-emb-head-dim 64
 --v-head-dim 128
 --moe-token-dispatcher-type alltoall_seq
 --moe-grouped-gemm
 --moe-router-score-function sigmoid
 --moe-router-enable-expert-bias
 --moe-router-bias-update-rate 0.001
 --moe-router-load-balancing-type seq_aux_loss

Can anyone offer some potential reason for the abnormal gradient? Thanks

bugm avatar Mar 12 '25 10:03 bugm

Marking as stale. No activity in 60 days.

github-actions[bot] avatar May 11 '25 18:05 github-actions[bot]

try set --init-method-std to 0.02 ?

lk137095576 avatar Jul 08 '25 09:07 lk137095576