torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Fix compile optimizer when using an LR scheduler

Open intervitens opened this issue 7 months ago • 5 comments

Context

  • [ ] add a new feature
  • [x] fix a bug
  • [ ] update tests and/or documentation
  • [ ] other (please add here)

#2659 added an option to compile the optimizer, however doing so causes a crash when setting up an LR scheduler if a non-constant LR scheduler is used.

Crash log:
[rank0]: Traceback (most recent call last):                                                                                                                           
[rank0]:   File "/home/r/soft/torchtune/torchtune/recipes/full_finetune_distributed.py", line 1094, in <module>                                                       
[rank0]:     sys.exit(recipe_main())                                                                                                                                  
[rank0]:              ~~~~~~~~~~~^^                                                                                                                                   
[rank0]:   File "/home/r/soft/torchtune/torchtune/torchtune/config/_parse.py", line 99, in wrapper                                                                    
[rank0]:     sys.exit(recipe_main(conf))                                                                                                                              
[rank0]:              ~~~~~~~~~~~^^^^^^                                                                                                                               
[rank0]:   File "/home/r/soft/torchtune/torchtune/recipes/full_finetune_distributed.py", line 1088, in recipe_main                                                    
[rank0]:     recipe.setup(cfg=cfg)                                                                                                                                    
[rank0]:     ~~~~~~~~~~~~^^^^^^^^^                                                                                                                                    
[rank0]:   File "/home/r/soft/torchtune/torchtune/recipes/full_finetune_distributed.py", line 434, in setup                                                           
[rank0]:     self._lr_scheduler = self._setup_lr_scheduler(                                                                                                           
[rank0]:                          ~~~~~~~~~~~~~~~~~~~~~~~~^                                                                                                           
[rank0]:         cfg_lr_scheduler=cfg.get("lr_scheduler", None),                                                                                                      
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                      
[rank0]:         num_training_steps=self.total_epochs * self._steps_per_epoch,                                                                                        
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                        
[rank0]:         last_epoch=self.global_step - 1,                                                                                                                     
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                     
[rank0]:     )                                                                                                                                                        
[rank0]:     ^  
[rank0]:   File "/home/r/soft/torchtune/torchtune/recipes/full_finetune_distributed.py", line 484, in _setup_lr_scheduler                                             
[rank0]:     lr_scheduler = config.instantiate(                                                                                                                       
[rank0]:         cfg_lr_scheduler,                                                                                                                                    
[rank0]:     ...<2 lines>...                                                                                                                                          
[rank0]:         last_epoch=last_epoch,                                                                                                                               
[rank0]:     )                                                                                                                                                        
[rank0]:   File "/home/r/soft/torchtune/torchtune/torchtune/config/_instantiate.py", line 163, in instantiate                                                         
[rank0]:     return _instantiate_node(                                                                                                                                
[rank0]:         OmegaConf.to_container(config, resolve=True),                                                                                                        
[rank0]:         caller_globals=caller_globals,                                                                                                                       
[rank0]:         *args,                                                                                                                                               
[rank0]:     )                                                                                                                                                        
[rank0]:   File "/home/r/soft/torchtune/torchtune/torchtune/config/_instantiate.py", line 62, in _instantiate_node                                                    
[rank0]:     return _create_component(_component_, args, kwargs)                                                                                                      
[rank0]:   File "/home/r/soft/torchtune/torchtune/torchtune/config/_instantiate.py", line 24, in _create_component                                                    
[rank0]:     return _component_(*args, **kwargs)                                                                                                                      
[rank0]:   File "/home/r/soft/torchtune/torchtune/torchtune/training/lr_schedulers.py", line 58, in get_cosine_schedule_with_warmup                                   
[rank0]:     return LambdaLR(optimizer, lr_lambda, last_epoch)                                                                                                        
[rank0]:   File "/home/r/soft/torchtune/tt/lib/python3.13/site-packages/torch/optim/lr_scheduler.py", line 286, in __init__                                           
[rank0]:     super().__init__(optimizer, last_epoch)                                                                                                                  
[rank0]:     ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                  
[rank0]:   File "/home/r/soft/torchtune/tt/lib/python3.13/site-packages/torch/optim/lr_scheduler.py", line 131, in __init__                                           
[rank0]:     patch_track_step_called(self.optimizer)                                                                                                                  
[rank0]:     ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^                                                                                                                  
[rank0]:   File "/home/r/soft/torchtune/tt/lib/python3.13/site-packages/torch/optim/lr_scheduler.py", line 129, in patch_track_step_called                            
[rank0]:     opt.step = wrap_step(opt.step)  # type: ignore[method-assign]                                                                                            
[rank0]:                ~~~~~~~~~^^^^^^^^^^                                                                                                                           
[rank0]:   File "/home/r/soft/torchtune/tt/lib/python3.13/site-packages/torch/optim/lr_scheduler.py", line 118, in wrap_step                                          
[rank0]:     func = step_fn.__func__                                                                                                                                  
[rank0]:            ^^^^^^^^^^^^^^^^                                                                                                                                  
[rank0]: AttributeError: 'function' object has no attribute '__func__'. Did you mean: '__doc__'?

Changelog

  • Moved optimizer compilation step after learning rate scheduler setup. This seems to resolve the issue, training launches and completes successfully when both optimizer compilation and LR scheduler are enabled.

  • Added parameters to cosine scheduler to set the minimum LR ratio during warmup and decay stages, which can be used to work around the bug ( https://github.com/pytorch/pytorch/issues/126514 ), that makes the parameters become NaN when a compiled non-fused Adam/AdamW optimizer performs a step while learning rate is set to exactly 0.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • [x] run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • [x] add unit tests for any new functionality
  • [x] update docstrings for any new or updated methods or classes
  • [ ] run unit tests via pytest tests
  • [ ] run recipe tests via pytest tests -m integration_test
  • [ ] manually run any new or modified recipes with sufficient proof of correctness
  • [ ] include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

  • [ ] I did not change any public API
  • [ ] I have added an example to docs or docstrings

intervitens avatar May 07 '25 00:05 intervitens

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2681

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar May 07 '25 00:05 pytorch-bot[bot]

On further testing, seems like there are still issues

  • When fused: false using an LR scheduler + compile optimizer results in nan loss after first step
  • optimizer_in_bwd not compatible with compile optimizer neither before not after the proposed fix
Error log:
[rank2]: Traceback (most recent call last):                                                                                                                           
[rank2]:   File "/home/r/soft/torchtune/torchtune/recipes/full_finetune_distributed.py", line 1094, in <module>                                                       
[rank2]:     sys.exit(recipe_main())                                                                                                                                  
[rank2]:              ~~~~~~~~~~~^^                                                                                                                                   
[rank2]:   File "/home/r/soft/torchtune/torchtune/torchtune/config/_parse.py", line 99, in wrapper                                                                    
[rank2]:     sys.exit(recipe_main(conf))                                                                                                                              
[rank2]:              ~~~~~~~~~~~^^^^^^                                                                                                                               
[rank2]:   File "/home/r/soft/torchtune/torchtune/recipes/full_finetune_distributed.py", line 1088, in recipe_main
[rank2]:     recipe.setup(cfg=cfg)                                                                                                                                    
[rank2]:     ~~~~~~~~~~~~^^^^^^^^^                                                                                                                                    
[rank2]:   File "/home/r/soft/torchtune/torchtune/recipes/full_finetune_distributed.py", line 354, in setup                                                           
[rank2]:     self._optimizer.step,                                                                                                                                    
[rank2]:     ^^^^^^^^^^^^^^^^^^^^                                                                                                                                     
[rank2]: AttributeError: 'NoneType' object has no attribute 'step'

intervitens avatar May 07 '25 01:05 intervitens

https://github.com/pytorch/pytorch/issues/126514 Looks like the issue is caused by this pytorch bug, since it goes away when I modify get_cosine_schedule_with_warmup to have the minimum learning rate be a very small number instead of exactly 0.0

intervitens avatar May 08 '25 09:05 intervitens

@intervitens thanks for surfacing these and debugging. Do you have a repro command for the NaN bug? I tried using our LR scheduler with compiled optimizer patched onto this PR and I don't see any NaNs. As for the optimizer-in-backward issue, I wonder whether we can adapt @joecummings's changes from #2712 to the distributed recipe.

ebsmothers avatar May 09 '25 19:05 ebsmothers

This is the config that I can use to reproduce the issue: https://gist.github.com/intervitens/df9eef7fd3ff3eec979b6aa6214ea99c Running it with tune run --nproc_per_node 4 full_finetune_distributed --config config_llama_1B.yaml results in loss becoming nan after the first step. If I edit get_cosine_schedule_with_warmup to have the LR multiplier be a small value instead of 0.0, the issue disappears. I think the best way to solve this, without changing the default behavior of the scheduler`, would be to add optional parameters for minimum LR ratio during warmup and decay stage, then add a warning/error in the recipe when both LR scheduler and non-fused optimizer compilation are enabled, until the underlying pytorch issue is solved.

intervitens avatar May 09 '25 19:05 intervitens