BELLE
BELLE copied to clipboard
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
BUG
I got the below error message when I try to train BELLE-EXT-13B. I changed the backend from nccl to gloo.
Traceback (most recent call last):
File "/gpfs/home/user/code_SFT/../code_BELLE/train/src/train.py", line 398, in <module>
main()
File "/gpfs/home/user/code_SFT/../code_BELLE/train/src/train.py", line 390, in main
trainer.train(resume_from_checkpoint=None)
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 266, in __init__
self._configure_distributed_model(model)
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1073, in _configure_distributed_model
self._broadcast_model()
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1003, in _broadcast_model
dist.broadcast(p, groups._get_broadcast_src_rank(), group=self.data_parallel_group)
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper
return func(*args, **kwargs)
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 217, in broadcast
return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 118, in broadcast
return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1574, in broadcast
work.wait()
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2357587) of binary: /home/user/.conda/envs/belle/bin/python
Traceback (most recent call last):
File "/home/user/.conda/envs/belle/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.conda/envs/belle/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
../code_BELLE/train/src/train.py FAILED
Environment
GPU: 2*A100 80GB python==3.11.3 pytorch==2.0.1 (py3.11_cuda11.7_cudnn8.5.0_0) transformers==4.29.2 deepspeed==0.9.2 cuda.__version__==11.7
Cmd
I add two lines to train.py.
import deepspeed
deepspeed.init_distributed("gloo") # put this line at first line of main()
I add --ddp_backend gloo to bash.
torchrun --nproc_per_node 2 ./train/src/train.py \
--model_name_or_path ${model_name_or_path} \
--llama \
--deepspeed ./train/configs/deepspeed_config.json \
--train_file ${train_file} \
--validation_file ${validation_file} \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 4 \
--num_train_epochs 2 \
--model_max_length ${cutoff_len} \
--save_strategy "steps" \
--save_total_limit 3 \
--learning_rate 8e-6 \
--weight_decay 0.00001 \
--warmup_ratio 0.05 \
--lr_scheduler_type "cosine" \
--logging_steps 10 \
--evaluation_strategy "steps" \
--fp16 True \
--seed 1234 \
--gradient_checkpointing True \
--cache_dir ${cache_dir} \
--output_dir ${output_dir} \
--ddp_backend gloo