xh

Results 5 comments of xh

楼主,我的main.py没有做变动,下面是train_chat.sh: PRE_SEQ_LEN=128 LR=1e-2 NUM_GPUS=6 CHAT_TRAIN_DATA=/data/lxh/workspace/nlp/ft-dataset/lxh_v3/sft_v3_lxh_shuffle.json CHAT_VAL_DATA=/data/lxh/workspace/nlp/ft-dataset/lxh_v3/sft_v3_lxh_shuffle.json CHECKPOINT_NAME=output/chatglm-6b2-pt-lxh-v3-$PRE_SEQ_LEN-$LR #torchrun --standalone --nnodes=1 --nproc-per-node=$NUM_GPUS main.py \ CUDA_VISIBLE_DEVICES=1,2,3,4,5,6 python3 main.py \ --do_train \ --train_file $CHAT_TRAIN_DATA \ --validation_file $CHAT_VAL_DATA \ --preprocessing_num_workers 10 \ --prompt_column instruction...

> > > 同环境,同配置,同问题。如上操作后,出现新问题。 > > │ C:\ProgramData\anaconda3\envs\glm\lib\site-packages\torch\distributed\distributed_c10d.py:707 in _get_default_group │ 704 │ Getting the default process group created by init_process_group │ 705 │ """ │ 706 │ if not...