DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG]"DeepSpeedZeRoOffload missing '_restore_from_bit16_weights' method when loading checkpoints"

Open calliope-pro opened this issue 7 months ago • 3 comments

Describe the bug When trying to load a checkpoint with DeepSpeedZeRoOffload during testing phase, I'm encountering an AttributeError because the optimizer is trying to call a method named _restore_from_bit16_weights() which doesn't exist in the DeepSpeedZeRoOffload class.

To Reproduce Steps to reproduce the behavior:

Train a model using DeepSpeed ZeRO3 with parameter offloading Save a checkpoint using PyTorch Lightning Try to load the checkpoint for testing with trainer.test()

Required packages:

  • deepspeed==0.16.3
  • pytorch-lightning==2.4.0
  • torch==2.5.1

The error occurs specifically when trying to restore the optimizer state during checkpoint loading.

Expected behavior The checkpoint should load successfully for testing, properly restoring both model weights and optimizer state.

System info (please complete the following information):

  • OS: Ubuntu22.04
  • GPU count and types: x1 A6000
  • Python version: 3.11.6

calliope-pro avatar May 06 '25 13:05 calliope-pro

I can attest and reproduce the bug

championsnet avatar May 08 '25 12:05 championsnet

@calliope-pro, @championsnet can you provide stack trace and repro scripts?

sfc-gh-truwase avatar May 13 '25 11:05 sfc-gh-truwase

@sfc-gh-truwase

deepspeed_config = {
    "bf16": {"enabled": True},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu", "pin_memory": True},
        "offload_param": {"device": "cpu", "pin_memory": True},
        "stage3_param_persistence_threshold": 1e3,  # Set as small as possible
        "stage3_max_live_parameters": 5e6,  # Significantly smaller than default
    },
    "zero_allow_untested_optimizer": True,
    "zero_force_ds_cpu_optimizer": False
}
trainer = Trainer(
    strategy=DeepSpeedStrategy(
        config=deepspeed_config,
    ),
    precision="bf16-mixed",
    min_epochs=config["min_epochs"],
    max_epochs=config["max_epochs"],
    callbacks=[
        EarlyStopping(monitor="validation_loss", patience=4, mode="min"),
        ModelCheckpoint(
            filename="best-checkpoint-{epoch:02d}",
            monitor="validation_loss",
            mode="min",
            save_last=True,
        ),
    ],
    gradient_clip_val=1.0,
    log_every_n_steps=4,
)

def setup_logger(log_dir: str):
    logger = getLogger(__name__)

    if trainer.global_rank == 0:
        # Set to DEBUG level and add file handler only for rank 0
        logger.setLevel(DEBUG)
        Path(log_dir).mkdir(parents=True, exist_ok=True)
        file_handler = FileHandler(f"{log_dir}/test.log", encoding="utf-8")
        formatter = Formatter(
            "%(asctime)s - %(levelname)s - %(filename)s - %(name)s - %(funcName)s - %(message)s"
        )
        file_handler.setFormatter(formatter)
        logger.addHandler(file_handler)
        # Save config file
        with open(f"{log_dir}/config.yaml", "w") as f:
            yaml.dump(config, f)
    else:
        # Completely suppress log output for ranks other than 0
        logger.setLevel(100)  # Set higher than INFO and DEBUG levels
    return logger

logger = setup_logger(trainer.logger.log_dir)
logger.info(f"{config_path=}")
logger.info(summary(model, verbose=0, depth=6))

logger.info("Start testing 1")
trainer.test(
    model,
    datamodule=datamodule,
    ckpt_path="lightning_logs/version_xxx/checkpoints/last.ckpt",
)
logger.info("Finish testing 1")

I haven't verified if this code alone can reproduce the error, but this is how my code is structured.

calliope-pro avatar May 19 '25 21:05 calliope-pro

@calliope-pro, unfortunately this code is not runnable. Can you please share a self-contained repro? Also, can you share a stack trace?

@championsnet, can you help with the above?

tjruwase avatar Jun 06 '25 15:06 tjruwase

This happens on deepspeed==0.16.4 when:

  1. You setup a trainer using deespeed stage 3 with offloading initializing the optimizers and everything
  2. Perform training and save the checkpoint in sharded form.
  3. Try to load the sharded checkpoint and perform testing with the trainer.
[rank0]: Traceback (most recent call last):
[rank0]:   File "./scripts/training/classifier/train.py", line 597, in <module>
[rank0]:     train(arg1, arg2)
[rank0]:   File "./scripts/training/classifier/train.py", line 582, in train
[rank0]:     trainer.test(
[rank0]:   File "./.env/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 775, in test
[rank0]:     return call._call_and_handle_interrupt(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "./env/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "./env/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "./env/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 817, in _test_impl
[rank0]:     results = self._run(model, ckpt_path=ckpt_path)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "./env/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 999, in _run
[rank0]:     self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path)
[rank0]:   File "./env/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 408, in _restore_modules_and_callbacks
[rank0]:     self.resume_start(checkpoint_path)
[rank0]:   File "./env/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 83, in resume_start
[rank0]:     loaded_checkpoint = self.trainer.strategy.load_checkpoint(checkpoint_path)
[rank0]:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "./env/lib/python3.11/site-packages/lightning/pytorch/strategies/deepspeed.py", line 670, in load_checkpoint
[rank0]:     _, client_state = self.deepspeed_engine.load_checkpoint(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "./env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2939, in load_checkpoint
[rank0]:     self.optimizer._restore_from_bit16_weights()
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'DeepSpeedZeRoOffload' object has no attribute '_restore_from_bit16_weights'

Unfortunately, I cannot share the code as I moved away from this solution fast and cannot replicate the exact code that led to the error.

championsnet avatar Jun 06 '25 15:06 championsnet

Closing as no longer needed by OP.

tjruwase avatar Jun 14 '25 16:06 tjruwase