FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

multiprocessing train error

Open landerson85 opened this issue 2 years ago • 1 comments
trafficstars

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3765 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3766 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 2 (pid: 3767) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train run arges

torchrun --nnodes=1 --nproc_per_node=3 --master_port=20001 train/train.py
--model_name_or_path /data/model/vicuna/vicuna-7b
--data_path playground/data/dummy.json
--bf16 True
--output_dir /data/app/output
--num_train_epochs 3
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1200
--save_total_limit 10
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--logging_dir "/data/app/output"
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--tf32 True
--model_max_length 2048
--gradient_checkpointing True
--lazy_preprocess False

landerson85 avatar Apr 22 '23 08:04 landerson85

Pytorch version: 2.0.0+cu118 CUDA Version: 11.8 cuDNN version is : 8700 CUDA HOME: /usr/local/cuda Available GPUs: 3

landerson85 avatar Apr 23 '23 02:04 landerson85

Same error did you manage to solve it?

nouf01 avatar Jan 07 '24 05:01 nouf01