accelerate
accelerate copied to clipboard
a error when I use DeepSpeed with accelerate
System Info
my accelerate config:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 8
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
#---------------------------------------------------------------------------------------
error as follows:
Traceback (most recent call last):
File "/users_2/d00477216/3_train_gpus/main.py", line 38, in <module>
model, optimizer, train_dataloader, scheduler = accelerator.prepare(model, optimizer, train_dataloader, scheduler)
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/accelerator.py", line 1118, in prepare
result = self._prepare_deepspeed(*args)
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/accelerator.py", line 1415, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1547, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 324, in __init__
self.flatten_dense_tensors_aligned(
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 867, in flatten_dense_tensors_aligned
return self.flatten(align_dense_tensors(tensor_list, alignment))
RuntimeError: torch.cat(): expected a non-empty list of Tensors
Traceback (most recent call last):
File "/users_2/d00477216/3_train_gpus/main.py", line 38, in <module>
model, optimizer, train_dataloader, scheduler = accelerator.prepare(model, optimizer, train_dataloader, scheduler)
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/accelerator.py", line 1118, in prepare
result = self._prepare_deepspeed(*args)
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/accelerator.py", line 1415, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1547, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 324, in __init__
self.flatten_dense_tensors_aligned(
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 867, in flatten_dense_tensors_aligned
return self.flatten(align_dense_tensors(tensor_list, alignment))
RuntimeError: torch.cat(): expected a non-empty list of Tensors
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27320) of binary: /user_lc/envs/lc_nlp39/bin/python
Traceback (most recent call last):
File "/home/anaconda3/envs/lc_nlp39/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/commands/launch.py", line 908, in launch_command
deepspeed_launcher(args)
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/commands/launch.py", line 647, in deepspeed_launcher
distrib_run.run(args)
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [X] My own task or dataset (give details below)
Reproduction
deepspeed_plugin = DeepSpeedPlugin(gradient_accumulation_steps=args.gradient_accumulation_steps)
accelerator = Accelerator(mixed_precision=args.mixed_precision, deepspeed_plugin=deepspeed_plugin)
device = accelerator.device
transformers.set_seed(args.seed)
# ๅ ่ฝฝๆจกๅ
model_loader = args.MODEL_LOADER[args.base_model](args)
tokenizer, model = model_loader.load_model()
model.to(device)
# train
if args.do_train:
# data handle
train_dataset = args.PROCESSOR_CLASSES[args.task_name](args, 'train', tokenizer)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=args.train_batch_size,
collate_fn=train_dataset.collate_fn)
dev_dataset = args.PROCESSOR_CLASSES[args.task_name](args, 'dev', tokenizer)
dev_dataloader = DataLoader(dev_dataset, batch_size=args.dev_batch_size, collate_fn=dev_dataset.collate_fn)
optimizer, scheduler = set_optimizer(args, model, train_dataset)
accelerator.wait_for_everyone()
print('-----------------------------------', len(train_dataloader))
model, optimizer, train_dataloader, scheduler = accelerator.prepare(model, optimizer, train_dataloader, scheduler)
# train
train_gpus(args, model, train_dataloader, dev_dataloader, optimizer, scheduler, tokenizer, accelerator, device)
Expected behavior
I wonder how to fix it
Hello, can you print number of trainable parameters? Looking at the error, seems like there are no trainable params and hence pass non-empty list to optimizer error
@pacman100 thank you! I use the LoRA method to train my chatGLM models, below are the params:
trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815
LoRA config:
LORA_PARA = { 'r': 8, 'lora_alpha': 32, 'lora_dropout': 0.05, 'target_modules': ['query_key_value'], 'task_type': "CAUSAL_LM", 'inference_mode': False, }
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@xdnjust were you able to resolve the issue? I am also facing the same when using LoRA with any model. Turning off LoRA makes it work correctly.
@xdnjust I also encountered the same issue. Did you figure out how to resolve it? Or is there any update about this issue?
Same issue. ๐
Hello, can you please provide the entire script with the details on how the model is created before passing it to the accelerator.prepare as well as the versions of the torch, transformers, accelerate, peft and deepspeed and the command used for launching the script.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
same issue
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
same issue, anyone have idea how to fix it? ๐
@EnterpriseBCMe please give us a full reproducer to your problem that we can run
Please check all the parameter in optimizer requires gradient. I meet the same issue and filter the parameter by requiring gradient before put them into optimizer seems work.