torchft
torchft copied to clipboard
Using get_optimizer_state_dict inside state_dict causes TorchFT to get stuck
When I use torch.distributed.checkpoint.state_dict.get_optimizer_state_dict, e.g.,
optimizer_state_dict = get_optimizer_state_dict(
model=self._model,
optimizers=self._optimizer,
options=StateDictOptions(
full_state_dict=True,
cpu_offload=True,
),
)
instead of optimizer_state_dict = self._optimizer.state_dict(), TorchFT gets stuck at manager.py should_commit() method. Why is this happening?