accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

_prepare_deepspeed fail to capture correct kwargs with DummyOptim or DummyScheduler when calling prepare() multiple times

Open Jason3900 opened this issue 4 months ago • 3 comments

System Info

accelerate==0.34.2
python==3.10
deepspeed==0.15.1

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

Hey, since I may want to prepare only certain items depending on my training arguments (suppose I don't want to prepare scheduler this time), I decided to order them in a dict and call prepare function multiple time as the number of items are not fixed. After that, I use setattr to re-allocate them to their namespace. It works perfectly util I want to change my code to support deepspeed plugin.

        # handle scheduler manually
        accelerator_to_prepare = OrderedDict(
            [
                ("optimizer", self.optimizer),   
                ("train_dataloader", self.train_dataloader),
                ("valid_dataloader", self.valid_dataloader),
                ("lr_scheduler", self.lr_scheduler),
                ("model", self.model),
            ]
        )
        if self.use_gan:
            accelerator_to_prepare["discriminator"] = self.discriminator

        for k, v in accelerator_to_prepare.items():
            self.print_global_rank_0(f"start prepare {k}")
            setattr(self, k, self.accelerator.prepare(v))

In accelerator's _prepare_deepspeed function, it captures the prepared items and finds the corresponding optimizer and scheduler, then catch the kwargs passed to them to feed in deepspeed config to make all things work. But In my case, I call the accelerate prepare method multiple times, it only captures the last time call, which means the result only contains one item ([model] in my case). Thus it cannot successfully find out the kwargs needed by the optimizer and scheduler (because they're set to "auto" in deepspeed config). Which make the deepspeed_config_process failed with error.

        model = None
        optimizer = None
        scheduler = None
        for obj in result:
            if isinstance(obj, torch.nn.Module):
                model = obj
            elif isinstance(obj, (torch.optim.Optimizer, DummyOptim)):
                optimizer = obj
            elif (isinstance(obj, (LRScheduler, DummyScheduler))) or (
                type(obj).__name__ in deepspeed.runtime.lr_schedules.VALID_LR_SCHEDULES
            ):
                scheduler = obj

Expected behavior

I think accelerate should handle this scenario.

Jason3900 avatar Sep 30 '24 03:09 Jason3900