Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

RAM Usage Kept Increasing

Open haok1402 opened this issue 1 month ago • 1 comments

Describe the bug

During pre-training, each Python worker’s RAM usage continuously increases over time. With a fixed model / batch size, RES grows from ~31 GB to ~97 GB over several hours, and total system memory rises from ~350 GB to ~940 GB. Eventually, it crashed due to OOM (i.e. exit code 9).

Steps/Code to reproduce bug

Here are the main configs that I used.

declare -A MODEL_CONFIG
MODEL_CONFIG[tokenizer-type]="HuggingFaceTokenizer"
MODEL_CONFIG[tokenizer-model]="Qwen/Qwen3-0.6B"
MODEL_CONFIG[vocab-size]=151936
MODEL_CONFIG[position-embedding-type]="rope"
MODEL_CONFIG[rotary-base]=1000000
MODEL_CONFIG[max-position-embeddings]=40960
MODEL_CONFIG[num-layers]=28
MODEL_CONFIG[hidden-size]=1024
MODEL_CONFIG[ffn-hidden-size]=3072
MODEL_CONFIG[hidden-dropout]=0.0
MODEL_CONFIG[disable-bias-linear]=true
MODEL_CONFIG[swiglu]=true
MODEL_CONFIG[num-attention-heads]=16
MODEL_CONFIG[kv-channels]=128
MODEL_CONFIG[attention-dropout]=0.0
MODEL_CONFIG[qk-layernorm]=true
MODEL_CONFIG[init-method-std]=0.02
MODEL_CONFIG[normalization]="RMSNorm"
MODEL_CONFIG[norm-epsilon]=1e-6

declare -A TRAIN_CONFIG
TRAIN_CONFIG[train-iters]=3000
TRAIN_CONFIG[lr]=3e-4
TRAIN_CONFIG[min-lr]=3e-5
TRAIN_CONFIG[lr-warmup-iters]=150
TRAIN_CONFIG[lr-decay-iters]=2850
TRAIN_CONFIG[lr-decay-style]="cosine"
TRAIN_CONFIG[optimizer]="adam"
TRAIN_CONFIG[micro-batch-size]=1
TRAIN_CONFIG[global-batch-size]=1024
TRAIN_CONFIG[seq-length]=4096
TRAIN_CONFIG[bf16]=true
TRAIN_CONFIG[use-distributed-optimizer]=true
TRAIN_CONFIG[log-interval]=5
TRAIN_CONFIG[save-interval]=150
TRAIN_CONFIG[eval-interval]=150
TRAIN_CONFIG[eval-iters]=10

For the dataset-related arguments,

data_args_path=$(mktemp)
find $WORKSPACE/datasets/DCLM-baseline/tokenized/$BASE_MODEL -type f -name "*.idx" | sort | while read idx_file; do
    bin_file=${idx_file%.idx}.bin
    if [ ! -f $bin_file ]; then
        echo "Missing bin_file for: $idx_file"
        exit 1
    fi
    printf "1.0 %s " ${idx_file%.idx} >> $data_args_path
done
PRETRAIN_ARGS+=(--data-args-path $data_args_path)
PRETRAIN_ARGS+=(--split 969,30,1)

Expected behavior

Host RAM usage should not increase indefinitely during steady-state training.

Additional context

Image Image Image Image Image

haok1402 avatar Nov 25 '25 20:11 haok1402

+1

zhujian19891203 avatar Nov 26 '25 08:11 zhujian19891203

Hi @dimapihtar, can you help take a look? Thanks

BoxiangW avatar Dec 10 '25 22:12 BoxiangW