FastChat
FastChat copied to clipboard
multiprocessing train error
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3765 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3766 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 2 (pid: 3767) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
train run arges
torchrun --nnodes=1 --nproc_per_node=3 --master_port=20001 train/train.py
--model_name_or_path /data/model/vicuna/vicuna-7b
--data_path playground/data/dummy.json
--bf16 True
--output_dir /data/app/output
--num_train_epochs 3
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1200
--save_total_limit 10
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--logging_dir "/data/app/output"
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--tf32 True
--model_max_length 2048
--gradient_checkpointing True
--lazy_preprocess False
Pytorch version: 2.0.0+cu118 CUDA Version: 11.8 cuDNN version is : 8700 CUDA HOME: /usr/local/cuda Available GPUs: 3
Same error did you manage to solve it?