Add ability to set learning rate
We should be able to specify the learning rate the same way Lightning does after it finds the desired one with auto_lr_find. It's simpler to do it right before training rather than in model instantiation given how complicated the instantiate_model function is (because we're determining if we're starting from scratch, finetuning, replacing head, etc.).
This changes one TrainConfig param. Instead of auto_lr_find, there is just lr which can either be auto or a float (the desired learning rate).
Needs:
- [ ] testing
- [ ] documentation
Deploy Preview for silly-keller-664934 ready!
| Name | Link |
|---|---|
| Latest commit | 796681cb44c23cb51def3898e05b7432f0533d58 |
| Latest deploy log | https://app.netlify.com/sites/silly-keller-664934/deploys/631a066bed088700081e3ceb |
| Deploy Preview | https://deploy-preview-223--silly-keller-664934.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site settings.
π Deployed on https://deploy-preview-223--silly-keller-664934.netlify.app
Adding a small fix and testing:
- With
lr: auto, zamba tries to find the best learning rate, as requested. But you have to setnum_workers: 0to turn off multiprocessing_context. After Epoch 0, it fails with
RuntimeError: Early stopping conditioned on metric `val_macro_f1` which is not available. Pass in or modify your `EarlyStopping` callback to use any of the following: `train_loss`
- With
lr: 0.0005orlr: 0.002, we get confirmation that lr gets set:
2022-09-07 19:01:14.860 | INFO | zamba.models.model_manager:train_model:285 - Setting learning rate to 0.0005.
But partway through Epoch 0 it gets reverted:
Adjusting learning rate of group 0 to 1.0000e-03.
@ejm714 Any thoughts?
But partway through Epoch 0 it gets reverted:
Can you give more detail on when this happens? It's this at the start of epoch 0 or just randomly in the middle (which doesn't make sense).
This may be caused by the optimizer, which references self.lr (https://github.com/drivendataorg/zamba/blob/master/zamba/pytorch_lightning/utils.py#L252-L270). We may need to set model.lr = train_config.lr in addition to (or in lieu of) model.hparams.lr. You'll have to experiment
RuntimeError: Early stopping conditioned on metric
val_macro_f1which is not available.
Do you have validation videos in your test? We've seen this error before.
Here's an example:
Epoch 0: 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 250/297 [04:32<00:51, 1.09s/it, loss=0.0839, v_num=6]Adjusting learning rate of group 0 to 1.0000e-03.
Epoch 0: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 297/297 [05:12<00:00, 1.05s/it, loss=0.0799, v_num=6Metric val_macro_f1 improved. New best score: 0.111
Epoch 1: 84%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 250/297 [09:46<01:50, 2.34s/it, loss=0.057, v_num=6]Adjusting learning rate of group 0 to 1.0000e-03.
Epoch 1: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 297/297 [10:24<00:00, 2.10s/it, loss=0.0542, v_num=6Metric val_macro_f1 improved by 0.075 >= min_delta = 0.0. New best score: 0.186
Epoch 2: 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 250/297 [15:03<02:49, 3.61s/it, loss=0.0571, v_num=6]Adjusting learning rate of group 0 to 1.0000e-03.
Epoch 2: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 297/297 [15:42<00:00, 3.17s/it, loss=0.0609, v_num=6Metric val_macro_f1 improved by 0.030 >= min_delta = 0.0. New best score: 0.216
Epoch 2: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 297/297 [15:42<00:00, 3.17s/it, loss=0.0609, v_num=6]Current lr: 0.001, Backbone lr: 1e-05
Epoch 3: 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 250/297 [21:21<04:00, 5.13s/it, loss=0.0493, v_num=6]Adjusting learning rate of group 0 to 1.0000e-03.
Adjusting learning rate of group 1 to 1.0000e-05.
Epoch 3: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 297/297 [22:01<00:00, 4.45s/it, loss=0.048, v_num=6]Current lr: 0.001, Backbone lr: 1e-05
Epoch 4: 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 250/297 [27:29<05:10, 6.60s/it, loss=0.0424, v_num=6]Adjusting learning rate of group 0 to 1.0000e-03.
Adjusting learning rate of group 1 to 1.0000e-05.
Epoch 4: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 297/297 [28:06<00:00, 5.68s/it, loss=0.0426, v_num=6]Current lr: 0.001, Backbone lr: 1e-05
Epoch 5: 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 250/297 [33:39<06:19, 8.08s/it, loss=0.0268, v_num=6]Adjusting learning rate of group 0 to 1.0000e-03.
Adjusting learning rate of group 1 to 1.0000e-05.
Epoch 5: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 297/297 [34:18<00:00, 6.93s/it, loss=0.0298, v_num=6Monitored metric val_macro_f1 did not improve in the last 3 records. Best score: 0.216. Signaling Trainer to stop.
Epoch 5: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 297/297 [34:18<00:00, 6.93s/it, loss=0.0298, v_num=6]
2022-09-08 13:40:52.113 | INFO | zamba.models.model_manager:train_model:321 - Calculating metrics on validation set.
I'll look into setting model.lr
Yes, setting model.lr seems to do the trick. It still adjusts lr during most epochs, but it seems to be always model.lr or model.lr/100
Epoch 0: 84%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 250/297 [04:20<00:48, 1.04s/it, loss=0.099, v_num=7]Adjusting learning rate of group 0 to 2.0000e-03.
Epoch 0: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 297/297 [04:57<00:00, 1.00s/it, loss=0.1, v_num=7Metric val_macro_f1 improved. New best score: 0.147
Epoch 1: 84%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 250/297 [09:28<01:46, 2.27s/it, loss=0.092, v_num=7]Adjusting learning rate of group 0 to 2.0000e-03.
Epoch 1: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 297/297 [10:03<00:00, 2.03s/it, loss=0.103, v_num=7Metric val_macro_f1 improved by 0.040 >= min_delta = 0.0. New best score: 0.187
Epoch 2: 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 250/297 [14:36<02:44, 3.51s/it, loss=0.0534, v_num=7]Adjusting learning rate of group 0 to 2.0000e-03.
Epoch 2: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 297/297 [15:13<00:00, 3.08s/it, loss=0.0485, v_num=7Metric val_macro_f1 improved by 0.001 >= min_delta = 0.0. New best score: 0.188
Epoch 2: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 297/297 [15:13<00:00, 3.08s/it, loss=0.0485, v_num=7]Current lr: 0.002, Backbone lr: 2e-05
Epoch 3: 84%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 250/297 [20:43<03:53, 4.97s/it, loss=0.0688, v_num=7]Adjusting learning rate of group 0 to 2.0000e-03.
Adjusting learning rate of group 1 to 2.0000e-05.
Epoch 3: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 297/297 [21:22<00:00, 4.32s/it, loss=0.0649, v_num=7]Current lr: 0.002, Backbone lr: 2e-05
Codecov Report
Merging #223 (796681c) into master (570f9cc) will increase coverage by
0.0%. The diff coverage is100.0%.
@@ Coverage Diff @@
## master #223 +/- ##
======================================
Coverage 87.0% 87.0%
======================================
Files 29 29
Lines 1937 1940 +3
======================================
+ Hits 1686 1689 +3
Misses 251 251
| Impacted Files | Coverage Ξ | |
|---|---|---|
| zamba/models/config.py | 96.7% <100.0%> (ΓΈ) |
|
| zamba/models/model_manager.py | 84.3% <100.0%> (+0.2%) |
:arrow_up: |
In these runs, the model reports that it is adjusting the learning rate, but the value it chooses is always the initial value that was provided; we're not seeing the legitimate-seeming adjustment we saw with auto_lr_find: true.
To see whether the changes we just made might be the cause of this behavior, I reverted them and ran the model again. The behavior is the same: it adjusts the learning rate near the end of each Epoch, but always chooses the initial value.
We need to wait until PTL fixes their bug before it's worth doing more here: https://github.com/Lightning-AI/lightning/issues/14674