FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

multi-node training not working

Open ashmalvayani opened this issue 1 year ago • 0 comments

Hey. I haven't found similar examples for using FastChat for multi-node training. My script is as follows:

I have found from this website https://pytorch.org/docs/stable/elastic/run.html to use the "rdzv-id, rdzv-backend and rdzv-endpoint" parameters to enable multi-node training by explicitly running each on different node and changing --node_rank. However I am getting the following error:

master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Traceback (most recent call last): File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/bin/torchrun", line 8, in sys.exit(main()) File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent result = agent.run() File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run result = self._invoke_run(role) File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run self._initialize_workers(self._worker_group) File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers self._rendezvous(worker_group) File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous() File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1028, in next_rendezvous self._op_executor.run(join_op, deadline) File "/home/ashmal.vayani/anaconda3/envs/finetune_mobillama/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 638, in run raise RendezvousTimeoutError() torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError

torchrun --nnodes=2 --nproc_per_node=8 --node_rank=0 --rdzv-id=345 --rdzv-backend=c10d --rdzv-endpoint=16.1.32.184 --master_port=40001 fastchat/train/train.py \ --deepspeed ds_config.json \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --data_path data/Data.json \ --bf16 True \ --output_dir ./outputs \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 100 \ --save_total_limit 3 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.04 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --report_to wandb \ --run_name "Experiment" \ --gradient_checkpointing True \ --lazy_preprocess True

ashmalvayani avatar Apr 05 '24 21:04 ashmalvayani