pytorch-image-models [BUG] Training job cannot resume if LR scheduler is Plateau.

[BUG] Training job cannot resume if LR scheduler is Plateau.

Open wuye9036 opened this issue 4 years ago • 2 comments

File "Trainer.py", line 233, in main lr_scheduler.step(start_epoch) File "/multimedia-nfs/wuye/libs/miniconda3/envs/py38/lib/python3.8/site-packages/timm/scheduler/plateau_lr.py", line 83, in step self.lr_scheduler.step(metric, epoch) # step the base scheduler File "/multimedia-nfs/wuye/libs/miniconda3/envs/py38/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 624, in step current = float(metrics) TypeError: float() argument must be a string or a number, not 'NoneType'

Trainer.py is just renamed from official training script.

Sep 16 '21 12:09 wuye9036

@wuye9036 that is a known issue, I don't run into it frequently because I rarely run LR plateau schedules of lengths long enough to care too much about resume (usually short fine-tune tasks), but somethign that I intend to fix eventually. The overall state of that scheduler doesn't save or restore (I tried to keep them state free but that doesn't work so well for plateau). You can make it resume (without resuming the tracking of last) if you hack the metric to a fixed value that's definitely worse (below or above depending on metric direction) than any likely value...

Sep 17 '21 22:09 rwightman

By the hack I mean edit line 233 in main to lr_scheduler.step(start_epoch, metric=-100) or opposite if your metric scale is reverse.

Sep 17 '21 23:09 rwightman

pytorch-image-models pytorch-image-models copied to clipboard

[BUG] Training job cannot resume if LR scheduler is Plateau.

pytorch-image-models
pytorch-image-models copied to clipboard