physicsnemo icon indicating copy to clipboard operation
physicsnemo copied to clipboard

LBFGS optimizer doesn't work for PINN training 🐛[BUG]:

Open hasethinvd opened this issue 1 year ago • 1 comments

Version

24.01

On which installation method(s) does this occur?

Docker, Pip, Source

Describe the issue

After specifying the optimizer to be bfgs in config file, it overrides the max_steps to 0

Minimum reproducible example

#config
defaults :
  - modulus_default
  - arch:
      - fourier
      - modified_fourier
      - fully_connected
      - multiscale_fourier
  - scheduler: tf_exponential_lr
  - optimizer: bfgs
  - loss: sum


training:
  rec_results_freq: 1000
  max_steps : 150000

Relevant log output

[23:53:04] - lbfgs optimizer selected. Setting max_steps to 0
[23:53:05] - [step:     100000] lbfgs optimization in running
Error executing job with overrides: []
Traceback (most recent call last):
  File "/mount/data/test/eikonal/eikonal.py", line 313, in run
    slv.solve()
  File "/usr/local/lib/python3.10/dist-packages/modulus/sym/solver/solver.py", line 173, in solve
    self._train_loop(sigterm_handler)
  File "/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py", line 543, in _train_loop
    loss, losses = self._cuda_graph_training_step(step)
  File "/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py", line 730, in _cuda_graph_training_step
    self.apply_gradients()
  File "/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py", line 185, in bfgs_apply_gradients
    self.optimizer.step(self.bfgs_closure_func)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 379, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lbfgs.py", line 298, in step
    max_iter = group['max_iter']
KeyError: 'max_iter'

Environment details

No response

hasethinvd avatar May 09 '24 23:05 hasethinvd

This issue is still active and needs fixing.

avidcoder123 avatar Sep 06 '24 06:09 avidcoder123

This is an expected behavior of the LBFGS optimizer in Modulus-Sym. Inside Modulus-Sym, the optimizer will set the max_steps to zero. If the training is started from scratch, this issue should not show up and the training should run successfully. Reference:

[18:49:00] - attempting to restore from: outputs/helmholtz
[18:49:00] - optimizer checkpoint not found
[18:49:00] - model wave_network.0.pth not found
[18:49:00] - lbfgs optimizer selected. Setting max_steps to 0
/usr/local/lib/python3.10/dist-packages/modulus/sym/eq/derivatives.py:120: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
  with torch.cuda.amp.autocast(enabled=False):
[18:49:00] - [step:          0] lbfgs optimization in running
[18:49:58] - lbfgs optimization completed after 1000 steps
[18:49:58] - [step:          0] record constraint batch time:  5.987e-02s
[18:50:00] - [step:          0] record validators time:  2.309e+00s
[18:50:01] - [step:          0] saved checkpoint to outputs/helmholtz
[18:50:01] - [step:          0] loss:  1.007e+04
[18:50:01] - [step:          0] reached maximum training steps, finished training!

However, the above error occurs, if you switch the optimizer in the middle of training. For example, go from adam to bfgs after a few steps. While this is technically possible, Modulus-Sym does not currently allow such workflows. For such cases, its recommended to check the main Modulus library.

ktangsali avatar Oct 17 '24 19:10 ktangsali