为什么我微调时，每个epoch内都有大量The grad norm is nan. Skipping updating the model.

Open qingmuhe opened this issue 2 months ago • 0 comments

What is your question?

日志

[2025-10-29 23:59:01,773][root][WARNING] - The grad norm is nan. Skipping updating the model.
[2025-10-29 23:59:01,779][root][INFO] - train, rank: 0, epoch: 0/100, data_slice: 0/1, step_in_slice: 3/44, step_in_epoch: 3, total step: 3, (loss_avg_rank: 24.221), (loss_avg_slice: 27.633), (ppl_avg_slice: 1.002e+12), (acc_avg_slice: 0.000), (lr: 3.000e-07), [('loss_ctc', 24.181), ('loss_rich', 0.041), ('loss', 24.221), ('acc_rich', 1.0)], {'data_load': '1.313', 'forward_time': '0.201', 'backward_time': '0.158', 'optim_time': '0.111', 'total_time': '1.792'}, GPU, memory: usage: 0.918 GB, peak: 5.583 GB, cache: 6.035 GB, cache_peak: 6.035 GB
[2025-10-29 23:59:03,465][root][WARNING] - The grad norm is nan. Skipping updating the model.
[2025-10-29 23:59:03,471][root][INFO] - train, rank: 0, epoch: 0/100, data_slice: 0/1, step_in_slice: 4/44, step_in_epoch: 4, total step: 4, (loss_avg_rank: 24.631), (loss_avg_slice: 26.882), (ppl_avg_slice: 4.730e+11), (acc_avg_slice: 0.000), (lr: 3.000e-07), [('loss_ctc', 24.595), ('loss_rich', 0.035), ('loss', 24.631), ('acc_rich', 1.0)], {'data_load': '1.220', 'forward_time': '0.197', 'backward_time': '0.155', 'optim_time': '0.112', 'total_time': '1.692'}, GPU, memory: usage: 0.918 GB, peak: 5.583 GB, cache: 6.037 GB, cache_peak: 6.037 GB
[2025-10-29 23:59:05,070][root][WARNING] - The grad norm is nan. Skipping updating the model.

What have you tried?

我的训练脚本

# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
#  MIT License  (https://opensource.org/licenses/MIT)

workspace=`pwd`

# which gpu to train or finetune
# export CUDA_VISIBLE_DEVICES="0,1"
# export CUDA_VISIBLE_DEVICES="0"
export CUDA_VISIBLE_DEVICES="0"
# gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
gpu_num=1

# model_name from model_hub, or model_dir in local path

## option 1, download model automatically
model_name_or_model_dir="iic/SenseVoiceSmall"

## option 2, download model by git
#local_path_root=${workspace}/modelscope_models
#mkdir -p ${local_path_root}/${model_name_or_model_dir}
#git clone https://www.modelscope.cn/${model_name_or_model_dir}.git ${local_path_root}/${model_name_or_model_dir}
#model_name_or_model_dir=${local_path_root}/${model_name_or_model_dir}


# data dir, which contains: train.json, val.json
train_data=${workspace}/data/scpllm_train.jsonl
val_data=${workspace}/data/scpllm_val.jsonl

# exp output dir
output_dir="./outputs_hotword"
log_file="${output_dir}/log.txt"

deepspeed_config=${workspace}/deepspeed_conf/ds_stage1.json

mkdir -p ${output_dir}
echo "log_file: ${log_file}"

DISTRIBUTED_ARGS="
    --nnodes ${WORLD_SIZE:-1} \
    --nproc_per_node $gpu_num \
    --node_rank ${RANK:-0} \
    --master_addr ${MASTER_ADDR:-127.0.0.1} \
    --master_port ${MASTER_PORT:-26669}
"

echo $DISTRIBUTED_ARGS

# funasr trainer path
if [ -f `dirname $(which funasr)`/train_ds.py ]; then
    train_tool=`dirname $(which funasr)`/train_ds.py
elif [ -f `dirname $(which funasr)`/../lib/python*/site-packages/funasr/bin/train_ds.py ]; then
    train_tool=`dirname $(which funasr)`/../lib/python*/site-packages/funasr/bin/train_ds.py
else
    echo "Error: train_ds.py not found in funasr bin directory."
    train_tool=/home/ma-user/work/FunASR/funasr/bin/train_ds.py
    # exit 1
fi
ABSOLUTE_PATH=$(cd $(dirname $train_tool); pwd)
train_tool=${ABSOLUTE_PATH}/train_ds.py
echo "Using funasr trainer: ${train_tool}"

torchrun $DISTRIBUTED_ARGS \
${train_tool} \
++model="${model_name_or_model_dir}" \
++trust_remote_code=true \
++train_data_set_list="${train_data}" \
++valid_data_set_list="${val_data}" \
++dataset_conf.data_split_num=1 \
++dataset_conf.batch_sampler="BatchSampler" \
++dataset_conf.batch_size=9000  \
++dataset_conf.sort_size=1024 \
++dataset_conf.batch_type="token" \
++dataset_conf.num_workers=0 \
++train_conf.max_epoch=100 \
++train_conf.log_interval=1 \
++train_conf.resume=true \
++train_conf.validate_interval=2000 \
++train_conf.save_checkpoint_interval=2000 \
++train_conf.keep_nbest_models=3 \
++train_conf.avg_nbest_model=3 \
++train_conf.use_deepspeed=false \
++train_conf.use_fp16=false \
++optim=adamw \
++optim_conf.betas=[0.9,0.98] \
++optim_conf.weight_decay=0 \
++optim_conf.eps=1e-9 \
++train_conf.grad_clip=0.5 \
++scheduler_conf.warmup_steps=1000 \
++train_conf.deepspeed_config=${deepspeed_config} \
++optim_conf.lr=0.0003 \
++output_dir="${output_dir}" &> ${log_file}



# ++optim_conf.lr=0.0002 \
# ++dataset_conf.max_token_length=100 \
# ++train_conf.lr_scheduler="cosine" \

What's your environment?

OS (e.g., Linux): Ascend: 1*ascend-snt9b1|ARM: 24核 192GB Linux
FunASR Version (e.g., 1.0.0): 1.2.7
ModelScope Version (e.g., 1.11.0): 1.25.0
PyTorch Version (e.g., 2.0.0): 2.3.1
How you installed funasr (pip, source): source
Python version: Python 3.10.0
GPU (e.g., V100M32) ascend-snt9b1 NPU
CUDA/cuDNN version (e.g., cuda11.7):
Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
Any other relevant information:

Oct 29 '25 16:10 qingmuhe