Megatron-LM
Megatron-LM copied to clipboard
[QUESTION] MOE training meet abnormal gradient norm and loss
Hello, I am training a MOE model (16B total and 2.5B activated) and below are some tensorboard logs,
grad norm
Lm loss
load_balance_loss
as you can see, as training going on , loss and gradient turns abnormal,here are some key arguments
the ordinary arguments
--lr 4.2e-4
--min-lr 4.2e-5
--lr-decay-style cosine
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.95
--clip-grad 1.0
--init-method-std 0.006
--attention-dropout 0.0
--hidden-dropout 0.0
--lr-decay-iters 381469
--lr-warmup-iters 2000
--train-iters 381469
--micro-batch-size 2
--global-batch-size 4800
--num-layers 28
--hidden-size 2048
--num-attention-heads 16
--ffn-hidden-size 10944
--seq-length 4096
--max-position-embeddings 4096
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 2
--context-parallel-size 1
--swiglu
--normalization RMSNorm
--norm-epsilon 1e-6
--use-rotary-position-embeddings
--no-bias-swiglu-fusion
--no-rope-fusion
--position-embedding-type rope
--untie-embeddings-and-output-weights
--rotary-base 10000
--rotary-scaling-factor 40
--kv-channels 128
--bf16
arguments related to MOE
--qk-layernorm
--multi-latent-attention
--transformer-impl transformer_engine
--use-distributed-optimizer
--attention-backend flash
--moe-ffn-hidden-size 1408
--moe-router-topk 6
--num-experts 64
--moe-layer-freq 1
--moe-first-k-dense-replace 1
--moe-aux-loss-coeff 0.001
--moe-shared-expert-intermediate-size 2816
--expert-model-parallel-size 8
--kv-lora-rank 512
--qk-head-dim 128
--qk-pos-emb-head-dim 64
--v-head-dim 128
--moe-token-dispatcher-type alltoall_seq
--moe-grouped-gemm
--moe-router-score-function sigmoid
--moe-router-enable-expert-bias
--moe-router-bias-update-rate 0.001
--moe-router-load-balancing-type seq_aux_loss
Can anyone offer some potential reason for the abnormal gradient? Thanks
Marking as stale. No activity in 60 days.
try set --init-method-std to 0.02 ?