accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

a error when I use DeepSpeed with accelerate

Open xdnjust opened this issue 2 years ago โ€ข 3 comments

System Info

my accelerate config:
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 8
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

#---------------------------------------------------------------------------------------
error as follows:
Traceback (most recent call last):
  File "/users_2/d00477216/3_train_gpus/main.py", line 38, in <module>
    model, optimizer, train_dataloader, scheduler = accelerator.prepare(model, optimizer, train_dataloader, scheduler)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/accelerator.py", line 1118, in prepare
    result = self._prepare_deepspeed(*args)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/accelerator.py", line 1415, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1547, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 324, in __init__
    self.flatten_dense_tensors_aligned(
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 867, in flatten_dense_tensors_aligned
    return self.flatten(align_dense_tensors(tensor_list, alignment))
RuntimeError: torch.cat(): expected a non-empty list of Tensors
Traceback (most recent call last):
  File "/users_2/d00477216/3_train_gpus/main.py", line 38, in <module>
    model, optimizer, train_dataloader, scheduler = accelerator.prepare(model, optimizer, train_dataloader, scheduler)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/accelerator.py", line 1118, in prepare
    result = self._prepare_deepspeed(*args)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/accelerator.py", line 1415, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1547, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 324, in __init__
    self.flatten_dense_tensors_aligned(
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 867, in flatten_dense_tensors_aligned
    return self.flatten(align_dense_tensors(tensor_list, alignment))
RuntimeError: torch.cat(): expected a non-empty list of Tensors
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27320) of binary: /user_lc/envs/lc_nlp39/bin/python
Traceback (most recent call last):
  File "/home/anaconda3/envs/lc_nlp39/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/commands/launch.py", line 908, in launch_command
    deepspeed_launcher(args)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/accelerate/commands/launch.py", line 647, in deepspeed_launcher
    distrib_run.run(args)
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/user_lc/envs/lc_nlp39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

deepspeed_plugin = DeepSpeedPlugin(gradient_accumulation_steps=args.gradient_accumulation_steps)
accelerator = Accelerator(mixed_precision=args.mixed_precision, deepspeed_plugin=deepspeed_plugin)
device = accelerator.device

transformers.set_seed(args.seed)
# ๅŠ ่ฝฝๆจกๅž‹
model_loader = args.MODEL_LOADER[args.base_model](args)
tokenizer, model = model_loader.load_model()

model.to(device)
# train
if args.do_train:
    # data handle
    train_dataset = args.PROCESSOR_CLASSES[args.task_name](args, 'train', tokenizer)
    train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=args.train_batch_size,
                                  collate_fn=train_dataset.collate_fn)
    dev_dataset = args.PROCESSOR_CLASSES[args.task_name](args, 'dev', tokenizer)
    dev_dataloader = DataLoader(dev_dataset, batch_size=args.dev_batch_size, collate_fn=dev_dataset.collate_fn)

    optimizer, scheduler = set_optimizer(args, model, train_dataset)
    accelerator.wait_for_everyone()
    print('-----------------------------------', len(train_dataloader))
    model, optimizer, train_dataloader, scheduler = accelerator.prepare(model, optimizer, train_dataloader, scheduler)

    # train
    train_gpus(args, model, train_dataloader, dev_dataloader, optimizer, scheduler, tokenizer, accelerator, device)

Expected behavior

I wonder how to fix it

xdnjust avatar Jun 05 '23 10:06 xdnjust

Hello, can you print number of trainable parameters? Looking at the error, seems like there are no trainable params and hence pass non-empty list to optimizer error

pacman100 avatar Jun 05 '23 10:06 pacman100

@pacman100 thank you! I use the LoRA method to train my chatGLM models, below are the params:

trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815 trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815

xdnjust avatar Jun 05 '23 11:06 xdnjust

LoRA config:

LORA_PARA = { 'r': 8, 'lora_alpha': 32, 'lora_dropout': 0.05, 'target_modules': ['query_key_value'], 'task_type': "CAUSAL_LM", 'inference_mode': False, }

xdnjust avatar Jun 05 '23 11:06 xdnjust

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 05 '23 15:07 github-actions[bot]

@xdnjust were you able to resolve the issue? I am also facing the same when using LoRA with any model. Turning off LoRA makes it work correctly.

darth-c0d3r avatar Jul 11 '23 17:07 darth-c0d3r

@xdnjust I also encountered the same issue. Did you figure out how to resolve it? Or is there any update about this issue?

hajipour avatar Aug 03 '23 08:08 hajipour

Same issue. ๐Ÿ˜‚

ArtificialCat avatar Aug 25 '23 09:08 ArtificialCat

Hello, can you please provide the entire script with the details on how the model is created before passing it to the accelerator.prepare as well as the versions of the torch, transformers, accelerate, peft and deepspeed and the command used for launching the script.

pacman100 avatar Sep 01 '23 06:09 pacman100

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Sep 25 '23 15:09 github-actions[bot]

same issue

stephen-nju avatar Sep 26 '23 06:09 stephen-nju

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Oct 20 '23 15:10 github-actions[bot]

same issue, anyone have idea how to fix it? ๐Ÿ˜‚

EnterpriseBCMe avatar Apr 24 '24 08:04 EnterpriseBCMe

@EnterpriseBCMe please give us a full reproducer to your problem that we can run

muellerzr avatar Apr 29 '24 17:04 muellerzr

Please check all the parameter in optimizer requires gradient. I meet the same issue and filter the parameter by requiring gradient before put them into optimizer seems work.

SuperCarryDFY avatar Jul 04 '24 08:07 SuperCarryDFY