litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

LR scheduler can result in a division by 0

Open carmocca opened this issue 1 year ago • 2 comments

If --train.max_steps is equal to --train.lr_warmup_steps then the T_max will result in a division by 0 https://github.com/Lightning-AI/litgpt/blob/6fd737d3da240a67f4acb7a3ce733fa2e67538a4/litgpt/finetune/lora.py#L385

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/carlos/nightly-env/bin/litgpt", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:   File "/home/carlos/lit-parrot/litgpt/__main__.py", line 143, in main
[rank0]:     fn(**kwargs)
[rank0]:   File "/home/carlos/lit-parrot/litgpt/finetune/lora.py", line 143, in setup
[rank0]:     fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval)
[rank0]:   File "/home/carlos/lightning/src/lightning/fabric/fabric.py", line 866, in launch
[rank0]:     return self._wrap_and_launch(function, self, *args, **kwargs)
[rank0]:   File "/home/carlos/lightning/src/lightning/fabric/fabric.py", line 951, in _wrap_and_launch
[rank0]:     return launcher.launch(to_run, *args, **kwargs)
[rank0]:   File "/home/carlos/lightning/src/lightning/fabric/strategies/launchers/subprocess_script.py", line 107, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:   File "/home/carlos/lightning/src/lightning/fabric/fabric.py", line 957, in _wrap_with_setup
[rank0]:     return to_run(*args, **kwargs)
[rank0]:   File "/home/carlos/lit-parrot/litgpt/finetune/lora.py", line 196, in main
[rank0]:     fit(
[rank0]:   File "/home/carlos/lit-parrot/litgpt/finetune/lora.py", line 291, in fit
[rank0]:     scheduler.step()
[rank0]:   File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 838, in step
[rank0]:     scheduler.step(0)
[rank0]:   File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 187, in step
[rank0]:     values = self._get_closed_form_lr()
[rank0]:   File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 1029, in _get_closed_form_lr
[rank0]:     return [
[rank0]:   File "/home/carlos/nightly-env/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 1032, in <listcomp>
[rank0]:     * (1 + math.cos(math.pi * self.last_epoch / self.T_max))
[rank0]: ZeroDivisionError: float division by zero

Litgpt should validate that this doesn't happen

carmocca avatar May 06 '24 16:05 carmocca

Still occuring

MaxGonzalezSaez-Diez avatar Jul 08 '24 10:07 MaxGonzalezSaez-Diez

Thanks for the note!

rasbt avatar Jul 10 '24 21:07 rasbt