fairseq2 icon indicating copy to clipboard operation
fairseq2 copied to clipboard

Attempted unscale_ but _scale is None after resuming training.

Open LittlePea13 opened this issue 3 months ago • 3 comments

Describe the bug: When loading a previous checkpoint to resume training (after a cuda OOM) the scaler calls unscale before any scaling has been done and breaks.

Describe how to reproduce: Not sure how to reproduce it. My intuition is that is happening as described above, when resuming training and the scaler hasn't yet been called.

Describe the expected behavior: I believe the state of the scaler should be saved with the ckpt and loaded when resuming training.

Environment: fairseq2 - 0.5.0.dev0 (commit ad0b4900f56aa9f6130d335f8151bb3546c260ae) PyTorch - 2.4.0 Python - 3.10.14 CUDA - 12.1

I am using fsdp, with fp16 training.

Additional Context:

[Rank 122] 2025-09-29 12:49:58,649 INFO fairseq2 - Tokenizer loaded.
[Rank 122] 2025-09-29 12:59:29,465 INFO fairseq2 - float16 loss scale window set to 128.
[Rank 122] 2025-09-29 12:59:29,478 INFO fairseq2 - Running on 128 process(es).
[Rank 122] 2025-09-29 12:59:29,480 INFO fairseq2 - Restoring training from the last checkpoint at step 160000.
[Rank 122] 2025-09-29 12:59:29,480 INFO fairseq2 - Restoring the trainer state.
[Rank 122] 2025-09-29 12:59:29,484 INFO fairseq2 - Trainer state restored.
[Rank 122] 2025-09-29 12:59:29,484 INFO fairseq2 - Restoring the optimizer state.
[Rank 122] 2025-09-29 12:59:29,952 INFO fairseq2 - Optimizer state restored.
[Rank 122] 2025-09-29 12:59:29,952 INFO fairseq2 - Restoring the data reader state.
[Rank 122] 2025-09-29 12:59:29,955 INFO fairseq2 - Data reader state restored.
[Rank 122] 2025-09-29 13:00:23,639 INFO fairseq2 - Training restored. Resuming from step 160000.
[Rank 122] 2025-09-29 13:19:31,155 ERROR fairseq2 - Command failed with an unexpected error. See the logged stack trace for details.
Traceback (most recent call last):
  File "~/fairseq2/src/fairseq2/cli/_main.py", line 32, in main
    exit_code = _run()
  File "~/fairseq2/src/fairseq2/cli/_main.py", line 85, in _run
    return cli.run(context)
  File "~/fairseq2/src/fairseq2/cli/_cli.py", line 123, in run
    return args.command.run(context, args)  # type: ignore[no-any-return]
  File "~/fairseq2/src/fairseq2/cli/_cli.py", line 361, in run
    return self._handler.run(context, self._parser, args)
  File "~/fairseq2/src/fairseq2/cli/commands/recipe.py", line 158, in run
    self._do_run(context, args)
  File "~/fairseq2/src/fairseq2/cli/commands/recipe.py", line 236, in _do_run
    recipe.run()
  File "~/fairseq2/src/fairseq2/recipes/_trainer.py", line 488, in run
    self._do_run()
  File "~/fairseq2/src/fairseq2/recipes/_trainer.py", line 585, in _do_run
    self._run_step()
  File "~/fairseq2/src/fairseq2/recipes/_trainer.py", line 689, in _run_step
    self._loss_scaler.unscale_gradients_()
  File "~/fairseq2/src/fairseq2/optim/_dynamic_loss_scaler.py", line 211, in unscale_gradients_
    self._grad_scaler.unscale_(self._optimizer)
  File ".../lib/python3.10/site-packages/torch/distributed/fsdp/sharded_grad_scaler.py", line 266, in unscale_
    self._check_scale_growth_tracker("unscale_")
  File ".../lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 158, in _check_scale_growth_tracker
    assert self._scale is not None, (
AssertionError: Attempted unscale_ but _scale is None.  This may indicate your script did not use scaler.scale(loss or outputs) earlier in the iteration.

LittlePea13 avatar Oct 01 '25 12:10 LittlePea13

Thanks for the report @LittlePea13, is there a chance for you test your job on stable v0.5 release? I see that the commit you shared is quite old.

cbalioglu avatar Oct 01 '25 12:10 cbalioglu

Not for the moment, but I can try relaunching the same training when we are able to update our codebase to v0.5.

LittlePea13 avatar Oct 01 '25 13:10 LittlePea13

Thank you! In the meantime I will try to reproduce it using v0.5 as well.

cbalioglu avatar Oct 02 '25 10:10 cbalioglu