pytorch-image-models
pytorch-image-models copied to clipboard
[BUG] Training job cannot resume if LR scheduler is Plateau.
File "Trainer.py", line 233, in main lr_scheduler.step(start_epoch) File "/multimedia-nfs/wuye/libs/miniconda3/envs/py38/lib/python3.8/site-packages/timm/scheduler/plateau_lr.py", line 83, in step self.lr_scheduler.step(metric, epoch) # step the base scheduler File "/multimedia-nfs/wuye/libs/miniconda3/envs/py38/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 624, in step current = float(metrics) TypeError: float() argument must be a string or a number, not 'NoneType'
Trainer.py is just renamed from official training script.
@wuye9036 that is a known issue, I don't run into it frequently because I rarely run LR plateau schedules of lengths long enough to care too much about resume (usually short fine-tune tasks), but somethign that I intend to fix eventually. The overall state of that scheduler doesn't save or restore (I tried to keep them state free but that doesn't work so well for plateau). You can make it resume (without resuming the tracking of last) if you hack the metric to a fixed value that's definitely worse (below or above depending on metric direction) than any likely value...
By the hack I mean edit line 233 in main to lr_scheduler.step(start_epoch, metric=-100) or opposite if your metric scale is reverse.