ChatDoctor icon indicating copy to clipboard operation
ChatDoctor copied to clipboard

DDP expects same model across all ranks, but Rank 0 has 128 params, while rank 1 has inconsistent 0 params.

Open xukefaker opened this issue 1 year ago • 0 comments

Hi,I met a problem that said ranks have different model. Followings are details.


./train_lora.sh
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in y our application as needed.


===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda118.so CUDA SETUP: CUDA runtime path found: /root/anaconda3/envs/chat-doctor/lib/libcudart.so.11.0 CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda118.so... bin /root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda118.so




Finetuning model with params: base_model: /disk2/data/xk/retr-llm/files/model/llama-7b/ data_path: /disk2/data/xk/retr-llm/files/datasets/mental_health_chatbot_dataset.json output_dir: ./lora-chatDoctor_bs192_Mbs24_ep3_len512_lr3e-5_fromAlpacaLora batch_size: 192 micro_batch_size: 24 num_epochs: 3 learning_rate: 3e-05 cutoff_len: 256 val_set_size: 120 use_gradient_checkpointing: False lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: None bottleneck_size: 256 non_linearity: tanh adapter_dropout: 0.0 use_parallel_adapter: False use_adapterp: False train_on_inputs: True scaling: 1.0 adapter_name: lora target_modules: None group_by_length: False wandb_project: wandb_run_name: wandb_watch: wandb_log_model: resume_from_checkpoint: None Loading checkpoint shards: 100%|##########| 33/33 [00:12<00:00, 2.58it/s] trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 Map: 100%|##########| 52/52 [00:00<00:00, 687.22 examples/s] Map: 100%|##########| 120/120 [00:00<00:00, 765.56 examples/s] [E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807082 milliseconds before timi ng out. Traceback (most recent call last): File "train_lora.py", line 353, in fire.Fire(train) File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "train_lora.py", line 299, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train return inner_training_loop( File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/transformers/trainer.py", line 1749, in _inner_training_loop model = self._wrap_model(self.model_wrapped) File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/transformers/trainer.py", line 1569, in _wrap_model model = nn.parallel.DistributedDataParallel( File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 674, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: DDP expects same model across all ranks, but Rank 0 has 128 params, while rank 1 has inconsistent 0 params. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplet e data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807082 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807414 milliseconds before timi ng out. Traceback (most recent call last): File "train_lora.py", line 353, in fire.Fire(train) File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "train_lora.py", line 299, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train return inner_training_loop( File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/transformers/trainer.py", line 1749, in _inner_training_loop model = self._wrap_model(self.model_wrapped) File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/transformers/trainer.py", line 1569, in _wrap_model model = nn.parallel.DistributedDataParallel( File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 674, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: DDP expects same model across all ranks, but Rank 3 has 128 params, while rank 0 has inconsistent 0 params. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplet e data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807414 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807716 milliseconds before timi ng out.


my environment: GPU:8 X A100 80GB pytorch version:2.0.1

How can I solve this bug?Thanks!

xukefaker avatar Sep 10 '23 05:09 xukefaker