DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Exception raised at the end of training with deepcompile enabled

Open eternalNight opened this issue 3 months ago • 1 comments

Describe the bug

With a training script like the following:

import deepspeed
import deepspeed.comm as dist

def main(args):
    deepspeed.init_distributed()
    model = Model()
    ......
    model.destroy()
    dist.destroy_process_group()

The following exception is raised at the end of the training process if and only if deepcompile is enabled:

Exception ignored in: <function DeepSpeedEngine.__del__ at 0x7f241b4fe830>
Traceback (most recent call last):
  File "/mnt/engines/deepspeed/deepspeed/runtime/engine.py", line 519, in __del__
    self.destroy()
  File "/mnt/engines/deepspeed/deepspeed/runtime/engine.py", line 523, in destroy
    self.optimizer.destroy()
  File "/mnt/engines/deepspeed/deepspeed/runtime/zero/stage3.py", line 468, in destroy
    self.parameter_offload.destroy()
  File "/mnt/engines/deepspeed/deepspeed/runtime/zero/parameter_offload.py", line 227, in destroy
    self._remove_module_hooks()
  File "/mnt/engines/deepspeed/deepspeed/runtime/zero/parameter_offload.py", line 241, in _remove_module_hooks
    print_rank_0(f'Deleted module hooks: forward = {num_forward_hooks}, backward = {num_backward_hooks}',
  File "/mnt/engines/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 113, in print_rank_0
    rank = dist.get_rank()
  File "/mnt/engines/deepspeed/deepspeed/comm/comm.py", line 720, in get_rank
    assert cdb is not None and cdb.is_initialized(
AssertionError: DeepSpeed backend not set, please initialize it using init_process_group()

To Reproduce

Steps to reproduce the behavior:

  1. Run https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3 with deepspeed --num_gpus=N openvla-like.py -c

Expected behavior No exception is raised.

eternalNight avatar Sep 22 '25 07:09 eternalNight

Not sure if the model is referenced in any global variable that defers its deletion.

Also, I think we should consider cleaning up duplicated definitions of print_rank_0 and replacing them with calls to the global logger (after extending it with those multi-gpu debugging facilities).

eternalNight avatar Sep 22 '25 07:09 eternalNight