accelerate a error when I use DeepSpeed with accelerate

System Info

my accelerate config:
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

#---------------------------------------------------------------------------------------
error as follows:
Traceback (most recent call last):
  File "/users_2/d00477216/3_train_gpus/main.py", line 38, in <module>
    model, optimizer, train_dataloader, scheduler = accelerator.prepare(model, optimizer, train_dataloader, scheduler)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/accelerator.py", line 1118, in prepare
    result = self._prepare_deepspeed(*args)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/accelerator.py", line 1415, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1547, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 324, in __init__
    self.flatten_dense_tensors_aligned(
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 867, in flatten_dense_tensors_aligned
    return self.flatten(align_dense_tensors(tensor_list, alignment))
RuntimeError: torch.cat(): expected a non-empty list of Tensors
Traceback (most recent call last):
  File "/users_2/d00477216/3_train_gpus/main.py", line 38, in <module>
    model, optimizer, train_dataloader, scheduler = accelerator.prepare(model, optimizer, train_dataloader, scheduler)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/accelerator.py", line 1118, in prepare
    result = self._prepare_deepspeed(*args)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/accelerator.py", line 1415, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1547, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 324, in __init__
    self.flatten_dense_tensors_aligned(
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 867, in flatten_dense_tensors_aligned
    return self.flatten(align_dense_tensors(tensor_list, alignment))
RuntimeError: torch.cat(): expected a non-empty list of Tensors
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27320) of binary: /user_lc/envs/lc_nlp39/bin/python
Traceback (most recent call last):
  File "/home/anaconda3/envs/lc_nlp39/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/commands/launch.py", line 908, in launch_command
    deepspeed_launcher(args)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/commands/launch.py", line 647, in deepspeed_launcher
    distrib_run.run(args)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

deepspeed_plugin = DeepSpeedPlugin(gradient_accumulation_steps=args.gradient_accumulation_steps)
accelerator = Accelerator(mixed_precision=args.mixed_precision, deepspeed_plugin=deepspeed_plugin)
device = accelerator.device

transformers.set_seed(args.seed)
# 加载模型
model_loader = args.MODEL_LOADER[args.base_model](args)
tokenizer, model = model_loader.load_model()

model.to(device)
# train
if args.do_train:
    # data handle
    train_dataset = args.PROCESSOR_CLASSES[args.task_name](args, 'train', tokenizer)
    train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=args.train_batch_size,
                                  collate_fn=train_dataset.collate_fn)
    dev_dataset = args.PROCESSOR_CLASSES[args.task_name](args, 'dev', tokenizer)
    dev_dataloader = DataLoader(dev_dataset, batch_size=args.dev_batch_size, collate_fn=dev_dataset.collate_fn)

    optimizer, scheduler = set_optimizer(args, model, train_dataset)
    accelerator.wait_for_everyone()
    print('-----------------------------------', len(train_dataloader))
    model, optimizer, train_dataloader, scheduler = accelerator.prepare(model, optimizer, train_dataloader, scheduler)

    # train
    train_gpus(args, model, train_dataloader, dev_dataloader, optimizer, scheduler, tokenizer, accelerator, device)

Expected behavior

I wonder how to fix it

Jun 05 '23 10:06 xdnjust

Hello, can you print number of trainable parameters? Looking at the error, seems like there are no trainable params and hence pass non-empty list to optimizer error

Jun 05 '23 10:06 pacman100

@pacman100 thank you! I use the LoRA method to train my chatGLM models, below are the params:

trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815

Jun 05 '23 11:06 xdnjust

LoRA config:

LORA_PARA = { 'r': 8, 'lora_alpha': 32, 'lora_dropout': 0.05, 'target_modules': ['query_key_value'], 'task_type': "CAUSAL_LM", 'inference_mode': False, }

Jun 05 '23 11:06 xdnjust

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 05 '23 15:07 github-actions[bot]

@xdnjust were you able to resolve the issue? I am also facing the same when using LoRA with any model. Turning off LoRA makes it work correctly.

Jul 11 '23 17:07 darth-c0d3r

@xdnjust I also encountered the same issue. Did you figure out how to resolve it? Or is there any update about this issue?

Aug 03 '23 08:08 hajipour

Same issue. 😂

Aug 25 '23 09:08 ArtificialCat

Hello, can you please provide the entire script with the details on how the model is created before passing it to the accelerator.prepare as well as the versions of the torch, transformers, accelerate, peft and deepspeed and the command used for launching the script.

Sep 01 '23 06:09 pacman100

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 25 '23 15:09 github-actions[bot]

same issue

Sep 26 '23 06:09 stephen-nju

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 20 '23 15:10 github-actions[bot]

same issue, anyone have idea how to fix it? 😂

Apr 24 '24 08:04 EnterpriseBCMe

@EnterpriseBCMe please give us a full reproducer to your problem that we can run

Apr 29 '24 17:04 muellerzr

Please check all the parameter in optimizer requires gradient. I meet the same issue and filter the parameter by requiring gradient before put them into optimizer seems work.

Jul 04 '24 08:07 SuperCarryDFY

accelerate accelerate copied to clipboard

a error when I use DeepSpeed with accelerate

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard