park1200656

Results 2 comments of park1200656

RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4428, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1805357 milliseconds before timing out....

안녕하세요, 실행 명령어는 아래와 같습니다. 감사합니다. torchrun --nproc_per_node=8 --master_port=34321 main.py \ --model_name='EleutherAI/polyglot-ko-1.3b' \ --train_file_path='data/text_ko_alpaca_data_small.jsonl' \ --num_train_epochs=1 \ --data_text_column='text' \ --block_size=512 \ --batch_size=2 \ --fp16=True \ --deepspeed=ds_zero3.json # --fsdp_transformer_layer_cls_to_wrap='GPTNeoXLayer' \