Gao Shiyuan
Gao Shiyuan
@Weiyun1025 hi, 使用下面的脚本进行sft微调,双机a800,发现卡在forward language_model,请问有碰到过吗 set -x export MASTER_PORT=34235 export TF_CPP_MIN_LOG_LEVEL=3 export USE_TCS_LOADER=0 export LAUNCHER=pytorch \# Set the task name CURRENT_PATH=$(pwd) PROJECT_NAME=internvl3_5_30b_sft TASK_NAME=$(basename "$0") TASK_NAME="${TASK_NAME%.*}" echo "TASK_NAME: $TASK_NAME" echo "PROJECT_NAME: $PROJECT_NAME"...
> 大概卡了多久呢,这套代码没有对MoE优化过,所以训练确实会比较慢,30B MoE的速度大概和38B差不多,不一定是卡住了 @Weiyun1025 30min后,nccl timeout [rank3]:[E902 12:24:58.761850774 ProcessGroupNCCL.cpp:632] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=77419, OpType=_ALLGATHER_BASE, NumelIn=98304, NumelOut=1572864, Timeout(ms)=1800000) ran for 1800024 milliseconds before timing out. [rank8]:[E902 12:24:58.757408621...