pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

`ReduceLROnPlateau` update error while resuming from a checkpoint

Open geoffrey-g-delhomme opened this issue 2 years ago • 3 comments

🐛 Bug

While resuming from a checkpoint, I get this error: pytorch_lightning.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric loss/train which is not available. Available metrics are: ['monitor/epoch_train']. Condition can be set using monitor key in lr scheduler dict. After some analysis, it seems the last epoch at which the model was saved is rerun with some restart internal flags, except the steps in between epoch_start and epoch_end callbacks (fit_loop's done is true at restart). The consequence is that metrics computed in steps are not available when lr_schedulers are called, especially ReduceLROnPlateau.

To Reproduce

Expected behavior

Skip update of ReduceLROnPlateau update in epoch_end by replacing self.epoch_loop.update_lr_schedulers("epoch", update_plateau_schedulers=True) by self.epoch_loop.update_lr_schedulers("epoch", update_plateau_schedulers=not self.restarting) in fit_loop.py:308.

Environment

* CUDA:
        - GPU:
                - NVIDIA A10G
                - NVIDIA A10G
                - NVIDIA A10G
                - NVIDIA A10G
                - NVIDIA A10G
                - NVIDIA A10G
                - NVIDIA A10G
                - NVIDIA A10G
        - available:         True
        - version:           11.6
* Lightning:
        - pytorch-lightning: 1.7.5
        - torch:             1.12.1+cu116
        - torch-tb-profiler: 0.4.0
        - torchmetrics:      0.9.3
        - torchvision:       0.13.1+cu116
* Packages:
        - absl-py:           1.2.0
        - aiohttp:           3.8.1
        - aiosignal:         1.2.0
        - albumentations:    1.2.1
        - alembic:           1.8.1
        - argon2-cffi:       21.3.0
        - argon2-cffi-bindings: 21.2.0
        - asttokens:         2.0.8
        - async-timeout:     4.0.2
        - attrs:             22.1.0
        - backcall:          0.2.0
        - beautifulsoup4:    4.11.1
        - black:             22.8.0
        - bleach:            5.0.1
        - cachetools:        5.2.0
        - certifi:           2022.6.15
        - cffi:              1.15.1
        - cfgv:              3.3.1
        - charset-normalizer: 2.1.1
        - click:             8.1.3
        - cloudpickle:       2.1.0
        - coloredlogs:       15.0.1
        - commonmark:        0.9.1
        - coverage:          6.4.4
        - cycler:            0.11.0
        - databricks-cli:    0.17.3
        - debugpy:           1.6.3
        - decorator:         5.1.1
        - defusedxml:        0.7.1
        - distlib:           0.3.6
        - dnspython:         2.2.1
        - docker:            5.0.3
        - dohq-artifactory:  0.8.1
        - email-validator:   1.2.1
        - entrypoints:       0.4
        - executing:         1.0.0
        - fastjsonschema:    2.16.1
        - filelock:          3.8.0
        - flake8:            5.0.4
        - flask:             2.2.2
        - flatbuffers:       2.0.7
        - fonttools:         4.37.1
        - frozenlist:        1.3.1
        - fsspec:            2022.8.2
        - ghp-import:        2.1.0
        - gitdb:             4.0.9
        - gitpython:         3.1.27
        - google-auth:       2.11.0
        - google-auth-oauthlib: 0.4.6
        - greenlet:          1.1.3
        - griffe:            0.22.0
        - grpcio:            1.47.0
        - gunicorn:          20.1.0
        - humanfriendly:     10.0
        - identify:          2.5.3
        - idna:              3.3
        - imageio:           2.21.2
        - importlib-metadata: 4.12.0
        - iniconfig:         1.1.1
        - ipykernel:         6.15.2
        - ipython:           8.4.0
        - ipython-genutils:  0.2.0
        - ipywidgets:        8.0.1
        - isort:             5.10.1
        - itsdangerous:      2.1.2
        - jedi:              0.18.1
        - jinja2:            3.1.2
        - joblib:            1.1.0
        - jsonschema:        4.15.0
        - jupyter-client:    7.3.5
        - jupyter-core:      4.11.1
        - jupyterlab-pygments: 0.2.2
        - jupyterlab-widgets: 3.0.2
        - jupytext:          1.14.1
        - kiwisolver:        1.4.4
        - lxml:              4.9.1
        - mako:              1.2.2
        - markdown:          3.3.7
        - markdown-it-py:    2.1.0
        - markupsafe:        2.1.1
        - matplotlib:        3.5.3
        - matplotlib-inline: 0.1.6
        - mccabe:            0.7.0
        - mdit-py-plugins:   0.3.0
        - mdurl:             0.1.2
        - mergedeep:         1.3.4
        - mistune:           0.8.4
        - mkdocs:            1.3.1
        - mkdocs-autorefs:   0.4.1
        - mkdocs-jupyter:    0.21.0
        - mkdocs-material:   8.4.2
        - mkdocs-material-extensions: 1.0.3
        - mkdocstrings:      0.19.0
        - mkdocstrings-python: 0.7.1
        - mlflow:            1.28.0
        - mpmath:            1.2.1
        - multidict:         6.0.2
        - mypy:              0.971
        - mypy-extensions:   0.4.3
        - nbclient:          0.6.7
        - nbconvert:         6.5.3
        - nbformat:          5.4.0
        - nest-asyncio:      1.5.5
        - netron:            6.0.0
        - networkx:          2.8.6
        - nodeenv:           1.7.0
        - notebook:          6.4.12
        - numpy:             1.23.2
        - oauthlib:          3.2.0
        - onnx:              1.12.0
        - onnxruntime:       1.12.1
        - onnxsim:           0.4.8
        - opencv-python:     4.6.0.66
        - opencv-python-headless: 4.6.0.66
        - packaging:         21.3
        - pandas:            1.4.4
        - pandocfilters:     1.5.0
        - parso:             0.8.3
        - pathspec:          0.10.0
        - pexpect:           4.8.0
        - pickleshare:       0.7.5
        - pillow:            9.2.0
        - pip:               22.2.2
        - platformdirs:      2.5.2
        - pluggy:            1.0.0
        - pre-commit:        2.20.0
        - prometheus-client: 0.14.1
        - prometheus-flask-exporter: 0.20.3
        - prompt-toolkit:    3.0.30
        - protobuf:          3.19.4
        - psutil:            5.9.1
        - ptyprocess:        0.7.0
        - pure-eval:         0.2.2
        - py:                1.11.0
        - pyartifactory:     1.10.1
        - pyasn1:            0.4.8
        - pyasn1-modules:    0.2.8
        - pycocotools:       2.0.4
        - pycodestyle:       2.9.1
        - pycparser:         2.21
        - pydantic:          1.9.2
        - pydeprecate:       0.3.2
        - pyflakes:          2.5.0
        - pygments:          2.13.0
        - pyjwt:             2.4.0
        - pymdown-extensions: 9.5
        - pymongo:           4.2.0
        - pyparsing:         3.0.9
        - pyrsistent:        0.18.1
        - pytest:            7.1.2
        - pytest-cov:        3.0.0
        - pytest-env:        0.6.2
        - python-dateutil:   2.8.2
        - pytorch-lightning: 1.7.5
        - pyturbojpeg:       1.6.7
        - pytz:              2022.2.1
        - pywavelets:        1.3.0
        - pyyaml:            6.0
        - pyyaml-env-tag:    0.1
        - pyzmq:             23.2.1
        - qudida:            0.0.4
        - querystring-parser: 1.2.4
        - requests:          2.28.1
        - requests-oauthlib: 1.3.1
        - requests-toolbelt: 0.9.1
        - rich:              12.5.1
        - rsa:               4.9
        - scikit-image:      0.19.3
        - scikit-learn:      1.1.2
        - scipy:             1.9.1
        - send2trash:        1.8.0
        - setuptools:        58.1.0
        - six:               1.16.0
        - smmap:             5.0.0
        - soupsieve:         2.3.2.post1
        - sqlalchemy:        1.4.40
        - sqlparse:          0.4.2
        - stack-data:        0.5.0
        - sympy:             1.11.1
        - tabulate:          0.8.10
        - tensorboard:       2.10.0
        - tensorboard-data-server: 0.6.1
        - tensorboard-plugin-wit: 1.8.1
        - terminado:         0.15.0
        - threadpoolctl:     3.1.0
        - tifffile:          2022.8.12
        - tinycss2:          1.1.1
        - toml:              0.10.2
        - tomli:             2.0.1
        - torch:             1.12.1+cu116
        - torch-tb-profiler: 0.4.0
        - torchmetrics:      0.9.3
        - torchvision:       0.13.1+cu116
        - tornado:           6.2
        - tqdm:              4.64.0
        - traitlets:         5.3.0
        - types-protobuf:    3.20.1
        - types-pyyaml:      6.0.11
        - types-requests:    2.28.10
        - types-setuptools:  65.3.0
        - types-six:         1.16.19
        - types-urllib3:     1.26.24
        - typing-extensions: 4.3.0
        - urllib3:           1.26.12
        - virtualenv:        20.16.4
        - watchdog:          2.1.9
        - wcwidth:           0.2.5
        - webencodings:      0.5.1
        - websocket-client:  1.4.0
        - werkzeug:          2.2.2
        - wheel:             0.37.1
        - widgetsnbextension: 4.0.2
        - yacs:              0.1.8
        - yarl:              1.8.1
        - zipp:              3.8.1
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.10.4
        - version:           #1 SMP Wed Jun 15 08:55:08 UTC 2022

Additional context

cc @carmocca @justusschock @awaelchli @ninginthecloud @ananthsub @rohitgr7 @akihironitta

geoffrey-g-delhomme avatar Sep 14 '22 07:09 geoffrey-g-delhomme

After some analysis, it seems the last epoch at which the model was saved is rerun with some restart internal flags

do you mean the checkpoint that was restored was a mid-epoch checkpoint?

rohitgr7 avatar Sep 15 '22 13:09 rohitgr7

no, if I am not mistaken, it was an epoch end checkpoint, generated with ModelCheckpoint with interval set to epoch.

geoffrey-g-delhomme avatar Sep 15 '22 13:09 geoffrey-g-delhomme

then it should start from the new epoch and rerun the complete new epoch with newly generated metrics that will be used to update LR schedulers. Can you share a repro script to check this?

rohitgr7 avatar Sep 15 '22 14:09 rohitgr7