DeepSpeed
DeepSpeed copied to clipboard
[BUG] CompiledModuleWrapper causing issues with checkpoints and pipeline module
CompiledModuleWrapper is implemented as a wrapper class for the model. I see a few issues when running unit tests with compile enabled.
-
isinstance(self.module, PipelineModule) used in multiple places in the code breaks behavior. i.e. DeepspeedEngine._load_checkpoint
-
wrapper class inherits from torch.nn.Module. Wrapped module is added as sub module of the wrapper. Calling state_dict() will add the prefix "wrapped." to all the parameters Checkpoint saved from compiled module cannot be loaded to non compiled module or vice versa see test_save_tensor_clone
-
wrapper class inherits from torch.nn.Module. May introduce extra forward hook calls if the user adds them to all submodules of engine.
-
Attributes added to the wrapper do not propagate to the wrapped class. i.e self.module.checkpoint_parallel_write_pipeline is added in PipelineEngine.init after the module is wrapped. It is later used in PipelineModule.save_state_dict but undefined there.
-
While trying to mitigate the above issues by reusing the module instead of wrapping it and overriding only it's forward function results in infinite recursion as forward tries to trace itself in a recursive call.
To Reproduce I saw these issue while working on compile mode in the following tests deepspeed/tests/unit/checkpoint/test_pipeline.py::test_checkpoint_pipe_engine[1] deepspeed/tests/unit/checkpoint/test_zero_optimizer.py::TestSaveTensorClone::test_save_tensor_clone[True-1] added the following section to config_dict
"compile": {
"enabled": True,
"backend": "inductor",
"kwargs": {
"disable": True
}
}
System info (please complete the following information):
- OS: [e.g. Ubuntu 22.04]
- GPU count and types: 4x A100 single machine and 4x Gaudi2 single machine
- Python 3.10
- Torch 2.2.x
Launcher context python3 -u -m pytest -vv -s unit/checkpoint/test_pipeline.py::TestPipelineCheckpoint::test_checkpoint_pipe_engine[1] python3 -u -m pytest -vv -s unit/checkpoint/test_zero_optimizer.py::TestSaveTensorClone::test_save_tensor_clone[True-1]
Docker context nvidia docker nvcr.io/nvidia/pytorch:23.11-py3
@tohtana
FYI @tohtana