[BUG]"DeepSpeedZeRoOffload missing '_restore_from_bit16_weights' method when loading checkpoints"
Describe the bug When trying to load a checkpoint with DeepSpeedZeRoOffload during testing phase, I'm encountering an AttributeError because the optimizer is trying to call a method named _restore_from_bit16_weights() which doesn't exist in the DeepSpeedZeRoOffload class.
To Reproduce Steps to reproduce the behavior:
Train a model using DeepSpeed ZeRO3 with parameter offloading Save a checkpoint using PyTorch Lightning Try to load the checkpoint for testing with trainer.test()
Required packages:
- deepspeed==0.16.3
- pytorch-lightning==2.4.0
- torch==2.5.1
The error occurs specifically when trying to restore the optimizer state during checkpoint loading.
Expected behavior The checkpoint should load successfully for testing, properly restoring both model weights and optimizer state.
System info (please complete the following information):
- OS: Ubuntu22.04
- GPU count and types: x1 A6000
- Python version: 3.11.6
I can attest and reproduce the bug
@calliope-pro, @championsnet can you provide stack trace and repro scripts?
@sfc-gh-truwase
deepspeed_config = {
"bf16": {"enabled": True},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {"device": "cpu", "pin_memory": True},
"offload_param": {"device": "cpu", "pin_memory": True},
"stage3_param_persistence_threshold": 1e3, # Set as small as possible
"stage3_max_live_parameters": 5e6, # Significantly smaller than default
},
"zero_allow_untested_optimizer": True,
"zero_force_ds_cpu_optimizer": False
}
trainer = Trainer(
strategy=DeepSpeedStrategy(
config=deepspeed_config,
),
precision="bf16-mixed",
min_epochs=config["min_epochs"],
max_epochs=config["max_epochs"],
callbacks=[
EarlyStopping(monitor="validation_loss", patience=4, mode="min"),
ModelCheckpoint(
filename="best-checkpoint-{epoch:02d}",
monitor="validation_loss",
mode="min",
save_last=True,
),
],
gradient_clip_val=1.0,
log_every_n_steps=4,
)
def setup_logger(log_dir: str):
logger = getLogger(__name__)
if trainer.global_rank == 0:
# Set to DEBUG level and add file handler only for rank 0
logger.setLevel(DEBUG)
Path(log_dir).mkdir(parents=True, exist_ok=True)
file_handler = FileHandler(f"{log_dir}/test.log", encoding="utf-8")
formatter = Formatter(
"%(asctime)s - %(levelname)s - %(filename)s - %(name)s - %(funcName)s - %(message)s"
)
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)
# Save config file
with open(f"{log_dir}/config.yaml", "w") as f:
yaml.dump(config, f)
else:
# Completely suppress log output for ranks other than 0
logger.setLevel(100) # Set higher than INFO and DEBUG levels
return logger
logger = setup_logger(trainer.logger.log_dir)
logger.info(f"{config_path=}")
logger.info(summary(model, verbose=0, depth=6))
logger.info("Start testing 1")
trainer.test(
model,
datamodule=datamodule,
ckpt_path="lightning_logs/version_xxx/checkpoints/last.ckpt",
)
logger.info("Finish testing 1")
I haven't verified if this code alone can reproduce the error, but this is how my code is structured.
@calliope-pro, unfortunately this code is not runnable. Can you please share a self-contained repro? Also, can you share a stack trace?
@championsnet, can you help with the above?
This happens on deepspeed==0.16.4 when:
- You setup a trainer using deespeed stage 3 with offloading initializing the optimizers and everything
- Perform training and save the checkpoint in sharded form.
- Try to load the sharded checkpoint and perform testing with the trainer.
[rank0]: Traceback (most recent call last):
[rank0]: File "./scripts/training/classifier/train.py", line 597, in <module>
[rank0]: train(arg1, arg2)
[rank0]: File "./scripts/training/classifier/train.py", line 582, in train
[rank0]: trainer.test(
[rank0]: File "./.env/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 775, in test
[rank0]: return call._call_and_handle_interrupt(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "./env/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "./env/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "./env/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 817, in _test_impl
[rank0]: results = self._run(model, ckpt_path=ckpt_path)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "./env/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 999, in _run
[rank0]: self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path)
[rank0]: File "./env/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 408, in _restore_modules_and_callbacks
[rank0]: self.resume_start(checkpoint_path)
[rank0]: File "./env/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 83, in resume_start
[rank0]: loaded_checkpoint = self.trainer.strategy.load_checkpoint(checkpoint_path)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "./env/lib/python3.11/site-packages/lightning/pytorch/strategies/deepspeed.py", line 670, in load_checkpoint
[rank0]: _, client_state = self.deepspeed_engine.load_checkpoint(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "./env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2939, in load_checkpoint
[rank0]: self.optimizer._restore_from_bit16_weights()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'DeepSpeedZeRoOffload' object has no attribute '_restore_from_bit16_weights'
Unfortunately, I cannot share the code as I moved away from this solution fast and cannot replicate the exact code that led to the error.
Closing as no longer needed by OP.