pytorch-lightning
pytorch-lightning copied to clipboard
`ReduceLROnPlateau` update error while resuming from a checkpoint
🐛 Bug
While resuming from a checkpoint, I get this error: pytorch_lightning.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric loss/train which is not available. Available metrics are: ['monitor/epoch_train']. Condition can be set using
monitor key in lr scheduler dict
.
After some analysis, it seems the last epoch at which the model was saved is rerun with some restart internal flags, except the steps in between epoch_start and epoch_end callbacks (fit_loop's done
is true at restart). The consequence is that metrics computed in steps are not available when lr_schedulers are called, especially ReduceLROnPlateau.
To Reproduce
Expected behavior
Skip update of ReduceLROnPlateau update in epoch_end by replacing self.epoch_loop.update_lr_schedulers("epoch", update_plateau_schedulers=True)
by self.epoch_loop.update_lr_schedulers("epoch", update_plateau_schedulers=not self.restarting)
in fit_loop.py:308.
Environment
* CUDA:
- GPU:
- NVIDIA A10G
- NVIDIA A10G
- NVIDIA A10G
- NVIDIA A10G
- NVIDIA A10G
- NVIDIA A10G
- NVIDIA A10G
- NVIDIA A10G
- available: True
- version: 11.6
* Lightning:
- pytorch-lightning: 1.7.5
- torch: 1.12.1+cu116
- torch-tb-profiler: 0.4.0
- torchmetrics: 0.9.3
- torchvision: 0.13.1+cu116
* Packages:
- absl-py: 1.2.0
- aiohttp: 3.8.1
- aiosignal: 1.2.0
- albumentations: 1.2.1
- alembic: 1.8.1
- argon2-cffi: 21.3.0
- argon2-cffi-bindings: 21.2.0
- asttokens: 2.0.8
- async-timeout: 4.0.2
- attrs: 22.1.0
- backcall: 0.2.0
- beautifulsoup4: 4.11.1
- black: 22.8.0
- bleach: 5.0.1
- cachetools: 5.2.0
- certifi: 2022.6.15
- cffi: 1.15.1
- cfgv: 3.3.1
- charset-normalizer: 2.1.1
- click: 8.1.3
- cloudpickle: 2.1.0
- coloredlogs: 15.0.1
- commonmark: 0.9.1
- coverage: 6.4.4
- cycler: 0.11.0
- databricks-cli: 0.17.3
- debugpy: 1.6.3
- decorator: 5.1.1
- defusedxml: 0.7.1
- distlib: 0.3.6
- dnspython: 2.2.1
- docker: 5.0.3
- dohq-artifactory: 0.8.1
- email-validator: 1.2.1
- entrypoints: 0.4
- executing: 1.0.0
- fastjsonschema: 2.16.1
- filelock: 3.8.0
- flake8: 5.0.4
- flask: 2.2.2
- flatbuffers: 2.0.7
- fonttools: 4.37.1
- frozenlist: 1.3.1
- fsspec: 2022.8.2
- ghp-import: 2.1.0
- gitdb: 4.0.9
- gitpython: 3.1.27
- google-auth: 2.11.0
- google-auth-oauthlib: 0.4.6
- greenlet: 1.1.3
- griffe: 0.22.0
- grpcio: 1.47.0
- gunicorn: 20.1.0
- humanfriendly: 10.0
- identify: 2.5.3
- idna: 3.3
- imageio: 2.21.2
- importlib-metadata: 4.12.0
- iniconfig: 1.1.1
- ipykernel: 6.15.2
- ipython: 8.4.0
- ipython-genutils: 0.2.0
- ipywidgets: 8.0.1
- isort: 5.10.1
- itsdangerous: 2.1.2
- jedi: 0.18.1
- jinja2: 3.1.2
- joblib: 1.1.0
- jsonschema: 4.15.0
- jupyter-client: 7.3.5
- jupyter-core: 4.11.1
- jupyterlab-pygments: 0.2.2
- jupyterlab-widgets: 3.0.2
- jupytext: 1.14.1
- kiwisolver: 1.4.4
- lxml: 4.9.1
- mako: 1.2.2
- markdown: 3.3.7
- markdown-it-py: 2.1.0
- markupsafe: 2.1.1
- matplotlib: 3.5.3
- matplotlib-inline: 0.1.6
- mccabe: 0.7.0
- mdit-py-plugins: 0.3.0
- mdurl: 0.1.2
- mergedeep: 1.3.4
- mistune: 0.8.4
- mkdocs: 1.3.1
- mkdocs-autorefs: 0.4.1
- mkdocs-jupyter: 0.21.0
- mkdocs-material: 8.4.2
- mkdocs-material-extensions: 1.0.3
- mkdocstrings: 0.19.0
- mkdocstrings-python: 0.7.1
- mlflow: 1.28.0
- mpmath: 1.2.1
- multidict: 6.0.2
- mypy: 0.971
- mypy-extensions: 0.4.3
- nbclient: 0.6.7
- nbconvert: 6.5.3
- nbformat: 5.4.0
- nest-asyncio: 1.5.5
- netron: 6.0.0
- networkx: 2.8.6
- nodeenv: 1.7.0
- notebook: 6.4.12
- numpy: 1.23.2
- oauthlib: 3.2.0
- onnx: 1.12.0
- onnxruntime: 1.12.1
- onnxsim: 0.4.8
- opencv-python: 4.6.0.66
- opencv-python-headless: 4.6.0.66
- packaging: 21.3
- pandas: 1.4.4
- pandocfilters: 1.5.0
- parso: 0.8.3
- pathspec: 0.10.0
- pexpect: 4.8.0
- pickleshare: 0.7.5
- pillow: 9.2.0
- pip: 22.2.2
- platformdirs: 2.5.2
- pluggy: 1.0.0
- pre-commit: 2.20.0
- prometheus-client: 0.14.1
- prometheus-flask-exporter: 0.20.3
- prompt-toolkit: 3.0.30
- protobuf: 3.19.4
- psutil: 5.9.1
- ptyprocess: 0.7.0
- pure-eval: 0.2.2
- py: 1.11.0
- pyartifactory: 1.10.1
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pycocotools: 2.0.4
- pycodestyle: 2.9.1
- pycparser: 2.21
- pydantic: 1.9.2
- pydeprecate: 0.3.2
- pyflakes: 2.5.0
- pygments: 2.13.0
- pyjwt: 2.4.0
- pymdown-extensions: 9.5
- pymongo: 4.2.0
- pyparsing: 3.0.9
- pyrsistent: 0.18.1
- pytest: 7.1.2
- pytest-cov: 3.0.0
- pytest-env: 0.6.2
- python-dateutil: 2.8.2
- pytorch-lightning: 1.7.5
- pyturbojpeg: 1.6.7
- pytz: 2022.2.1
- pywavelets: 1.3.0
- pyyaml: 6.0
- pyyaml-env-tag: 0.1
- pyzmq: 23.2.1
- qudida: 0.0.4
- querystring-parser: 1.2.4
- requests: 2.28.1
- requests-oauthlib: 1.3.1
- requests-toolbelt: 0.9.1
- rich: 12.5.1
- rsa: 4.9
- scikit-image: 0.19.3
- scikit-learn: 1.1.2
- scipy: 1.9.1
- send2trash: 1.8.0
- setuptools: 58.1.0
- six: 1.16.0
- smmap: 5.0.0
- soupsieve: 2.3.2.post1
- sqlalchemy: 1.4.40
- sqlparse: 0.4.2
- stack-data: 0.5.0
- sympy: 1.11.1
- tabulate: 0.8.10
- tensorboard: 2.10.0
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- terminado: 0.15.0
- threadpoolctl: 3.1.0
- tifffile: 2022.8.12
- tinycss2: 1.1.1
- toml: 0.10.2
- tomli: 2.0.1
- torch: 1.12.1+cu116
- torch-tb-profiler: 0.4.0
- torchmetrics: 0.9.3
- torchvision: 0.13.1+cu116
- tornado: 6.2
- tqdm: 4.64.0
- traitlets: 5.3.0
- types-protobuf: 3.20.1
- types-pyyaml: 6.0.11
- types-requests: 2.28.10
- types-setuptools: 65.3.0
- types-six: 1.16.19
- types-urllib3: 1.26.24
- typing-extensions: 4.3.0
- urllib3: 1.26.12
- virtualenv: 20.16.4
- watchdog: 2.1.9
- wcwidth: 0.2.5
- webencodings: 0.5.1
- websocket-client: 1.4.0
- werkzeug: 2.2.2
- wheel: 0.37.1
- widgetsnbextension: 4.0.2
- yacs: 0.1.8
- yarl: 1.8.1
- zipp: 3.8.1
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.10.4
- version: #1 SMP Wed Jun 15 08:55:08 UTC 2022
Additional context
cc @carmocca @justusschock @awaelchli @ninginthecloud @ananthsub @rohitgr7 @akihironitta
After some analysis, it seems the last epoch at which the model was saved is rerun with some restart internal flags
do you mean the checkpoint that was restored was a mid-epoch checkpoint?
no, if I am not mistaken, it was an epoch end checkpoint, generated with ModelCheckpoint with interval set to epoch.
then it should start from the new epoch and rerun the complete new epoch with newly generated metrics that will be used to update LR schedulers. Can you share a repro script to check this?