torchft icon indicating copy to clipboard operation
torchft copied to clipboard

Using get_optimizer_state_dict inside state_dict causes TorchFT to get stuck

Open btian opened this issue 1 month ago • 0 comments

When I use torch.distributed.checkpoint.state_dict.get_optimizer_state_dict, e.g.,

optimizer_state_dict = get_optimizer_state_dict(
             model=self._model,
             optimizers=self._optimizer,
             options=StateDictOptions(
                 full_state_dict=True,
                 cpu_offload=True,
             ),
         )

instead of optimizer_state_dict = self._optimizer.state_dict(), TorchFT gets stuck at manager.py should_commit() method. Why is this happening?

btian avatar Oct 15 '25 22:10 btian