DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] nebula checkpoint engine AttributeError: 'str' object has no attribute 'tag'

Open unavailableun opened this issue 1 month ago • 0 comments

Describe the bug CheckpointEngine.commit(info: CheckpointCommitInfo) interface does not align with DeepSpeedEngine reference.

Image

Line 3527 in runtime/engine.py should be self.checkpoint_engine.commit(commit_info)

Image

[rank0]: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1381, in save_checkpoint [rank0]: self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options) [rank0]: File "/scratch/azureml/cr/j/85a4996e3fa242ed9f68c4faddc40a52/exe/wd/src/smile_lightning/core/strategies/deepspeed.py", line 103, in save_checkpoint [rank0]: self.deepspeed_engine.save_checkpoint( [rank0]: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3527, in save_checkpoint [rank0]: self.checkpoint_engine.commit(tag) [rank0]: File "/opt/conda/envs/ptca/lib/python3.10/site-packages/deepspeed/runtime/checkpoint_engine/nebula_checkpoint_engine.py", line 101, in commit [rank0]: tag = info.tag [rank0]: AttributeError: 'str' object has no attribute 'tag'

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior A clear and concise description of what you expected to happen.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types [e.g. two machines with x8 A100s each]
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version
  • Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

unavailableun avatar Nov 07 '25 03:11 unavailableun