LBFGS optimizer doesn't work for PINN training 🐛[BUG]:
Version
24.01
On which installation method(s) does this occur?
Docker, Pip, Source
Describe the issue
After specifying the optimizer to be bfgs in config file, it overrides the max_steps to 0
Minimum reproducible example
#config
defaults :
- modulus_default
- arch:
- fourier
- modified_fourier
- fully_connected
- multiscale_fourier
- scheduler: tf_exponential_lr
- optimizer: bfgs
- loss: sum
training:
rec_results_freq: 1000
max_steps : 150000
Relevant log output
[23:53:04] - lbfgs optimizer selected. Setting max_steps to 0
[23:53:05] - [step: 100000] lbfgs optimization in running
Error executing job with overrides: []
Traceback (most recent call last):
File "/mount/data/test/eikonal/eikonal.py", line 313, in run
slv.solve()
File "/usr/local/lib/python3.10/dist-packages/modulus/sym/solver/solver.py", line 173, in solve
self._train_loop(sigterm_handler)
File "/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py", line 543, in _train_loop
loss, losses = self._cuda_graph_training_step(step)
File "/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py", line 730, in _cuda_graph_training_step
self.apply_gradients()
File "/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py", line 185, in bfgs_apply_gradients
self.optimizer.step(self.bfgs_closure_func)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 379, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/lbfgs.py", line 298, in step
max_iter = group['max_iter']
KeyError: 'max_iter'
Environment details
No response
This issue is still active and needs fixing.
This is an expected behavior of the LBFGS optimizer in Modulus-Sym. Inside Modulus-Sym, the optimizer will set the max_steps to zero. If the training is started from scratch, this issue should not show up and the training should run successfully. Reference:
[18:49:00] - attempting to restore from: outputs/helmholtz
[18:49:00] - optimizer checkpoint not found
[18:49:00] - model wave_network.0.pth not found
[18:49:00] - lbfgs optimizer selected. Setting max_steps to 0
/usr/local/lib/python3.10/dist-packages/modulus/sym/eq/derivatives.py:120: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=False):
with torch.cuda.amp.autocast(enabled=False):
[18:49:00] - [step: 0] lbfgs optimization in running
[18:49:58] - lbfgs optimization completed after 1000 steps
[18:49:58] - [step: 0] record constraint batch time: 5.987e-02s
[18:50:00] - [step: 0] record validators time: 2.309e+00s
[18:50:01] - [step: 0] saved checkpoint to outputs/helmholtz
[18:50:01] - [step: 0] loss: 1.007e+04
[18:50:01] - [step: 0] reached maximum training steps, finished training!
However, the above error occurs, if you switch the optimizer in the middle of training. For example, go from adam to bfgs after a few steps. While this is technically possible, Modulus-Sym does not currently allow such workflows. For such cases, its recommended to check the main Modulus library.