pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Import error on shutdown/KeyboardInterrupt if ran from Jupyter Lab notebook cell

Open asigalov61 opened this issue 1 year ago • 6 comments

Bug description

Import error on shutdown/KeyboardInterrupt if ran from Jupyter Lab notebook cell. If ran from script everything works fine.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

Run trainer.fit from a Jupyter notebook cell, then click stop in Jupyter notebook.


print("---start train---")
trainer.fit(model, train_dataloader, ckpt_path=ckpt_path)

Error messages and logs

Detected KeyboardInterrupt, attempting graceful shutdown ...
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
~/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     45         if trainer.strategy.launcher is not None:
---> 46             return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     47         return trainer_fn(*args, **kwargs)

~/.local/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/multiprocessing.py in launch(self, function, trainer, *args, **kwargs)
    143         self.procs = process_context.processes
--> 144         while not process_context.join():
    145             pass

~/.local/lib/python3.10/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    117         # Wait for any process to fail or all of them to succeed.
--> 118         ready = multiprocessing.connection.wait(
    119             self.sentinels.keys(),

/usr/lib/python3.10/multiprocessing/connection.py in wait(object_list, timeout)
    930             while True:
--> 931                 ready = selector.select(timeout)
    932                 if ready:

/usr/lib/python3.10/selectors.py in select(self, timeout)
    415         try:
--> 416             fd_event_list = self._selector.poll(timeout)
    417         except InterruptedError:

KeyboardInterrupt: 

During handling of the above exception, another exception occurred:

NameError                                 Traceback (most recent call last)
/tmp/ipykernel_2824/3752444865.py in <module>
    189     ckpt_path = None
    190 print("---start train---")
--> 191 trainer.fit(model, train_dataloader, ckpt_path=ckpt_path)

~/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py in fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    536         self.state.status = TrainerStatus.RUNNING
    537         self.training = True
--> 538         call._call_and_handle_interrupt(
    539             self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    540         )

~/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     62         if isinstance(launcher, _SubprocessScriptLauncher):
     63             launcher.kill(_get_sigkill_signal())
---> 64         exit(1)
     65 
     66     except BaseException as exception:

NameError: name 'exit' is not defined

Environment

Current environment
  • CUDA:
    • GPU:
      • NVIDIA A100-SXM4-40GB
    • available: True
    • version: 12.1
  • Lightning:
    • lightning: 2.4.0
    • lightning-utilities: 0.11.7
    • pytorch-lightning: 2.4.0
    • torch: 2.4.1
    • torch-summary: 1.4.5
    • torchmetrics: 1.4.2
    • torchvision: 0.15.2
  • Packages:
    • absl-py: 0.15.0
    • aiohappyeyeballs: 2.4.3
    • aiohttp: 3.10.8
    • aiosignal: 1.3.1
    • aiosqlite: 0.19.0
    • annotated-types: 0.6.0
    • anyio: 4.1.0
    • appdirs: 1.4.4
    • argon2-cffi: 21.1.0
    • arrow: 1.3.0
    • astunparse: 1.6.3
    • async-lru: 2.0.4
    • async-timeout: 4.0.3
    • attrs: 23.1.0
    • automat: 20.2.0
    • babel: 2.13.1
    • backcall: 0.2.0
    • bcrypt: 3.2.0
    • beautifulsoup4: 4.10.0
    • beniget: 0.4.1
    • bleach: 4.1.0
    • blinker: 1.4
    • bottle: 0.12.19
    • bottleneck: 1.3.2
    • brotli: 1.0.9
    • cachetools: 5.0.0
    • certifi: 2020.6.20
    • cffi: 1.15.0
    • chardet: 4.0.0
    • charset-normalizer: 3.3.2
    • click: 8.0.3
    • cloud-init: 23.3.3
    • colorama: 0.4.4
    • comm: 0.2.0
    • command-not-found: 0.3
    • configobj: 5.0.6
    • constantly: 15.1.0
    • cryptography: 3.4.8
    • ctop: 1.0.0
    • cycler: 0.11.0
    • dacite: 1.8.1
    • dbus-python: 1.2.18
    • debugpy: 1.8.0
    • decorator: 4.4.2
    • defusedxml: 0.7.1
    • distlib: 0.3.4
    • distro: 1.7.0
    • distro-info: 1.1+ubuntu0.1
    • docker: 5.0.3
    • entrypoints: 0.4
    • et-xmlfile: 1.0.1
    • exceptiongroup: 1.2.0
    • fastjsonschema: 2.19.0
    • filelock: 3.6.0
    • flake8: 4.0.1
    • flatbuffers: 1.12.1-git20200711.33e2d80-dfsg1-0.6
    • fonttools: 4.29.1
    • fqdn: 1.5.1
    • frozenlist: 1.4.1
    • fs: 2.4.12
    • fsspec: 2024.9.0
    • future: 0.18.2
    • gast: 0.5.2
    • glances: 3.2.4.2
    • google-auth: 1.5.1
    • google-auth-oauthlib: 0.4.2
    • google-pasta: 0.2.0
    • grpcio: 1.30.2
    • h5py: 3.6.0
    • h5py.-debian-h5py-serial: 3.6.0
    • html5lib: 1.1
    • htmlmin: 0.1.12
    • httplib2: 0.20.2
    • huggingface-hub: 0.25.1
    • hyperlink: 21.0.0
    • icdiff: 2.0.4
    • idna: 3.3
    • imagehash: 4.3.1
    • importlib-metadata: 4.6.4
    • incremental: 21.3.0
    • influxdb: 5.3.1
    • iniconfig: 1.1.1
    • iotop: 0.6
    • ipykernel: 6.7.0
    • ipython: 7.31.1
    • ipython-genutils: 0.2.0
    • ipywidgets: 8.1.1
    • isoduration: 20.11.0
    • jax: 0.4.14
    • jaxlib: 0.4.14
    • jdcal: 1.0
    • jedi: 0.18.0
    • jeepney: 0.7.1
    • jinja2: 3.0.3
    • joblib: 0.17.0
    • json5: 0.9.14
    • jsonpatch: 1.32
    • jsonpointer: 2.0
    • jsonschema: 4.20.0
    • jsonschema-specifications: 2023.11.2
    • jupyter-client: 8.6.0
    • jupyter-console: 6.4.0
    • jupyter-core: 5.5.0
    • jupyter-events: 0.9.0
    • jupyter-lsp: 2.2.1
    • jupyter-server: 2.12.0
    • jupyter-server-fileid: 0.9.0
    • jupyter-server-terminals: 0.4.4
    • jupyter-ydoc: 1.1.1
    • jupyterlab: 4.0.9
    • jupyterlab-pygments: 0.1.2
    • jupyterlab-server: 2.25.2
    • jupyterlab-widgets: 3.0.9
    • kaptan: 0.5.12
    • keras: 2.13.1
    • keyring: 23.5.0
    • kiwisolver: 1.3.2
    • launchpadlib: 1.10.16
    • lazr.restfulclient: 0.14.4
    • lazr.uri: 1.0.6
    • libtmux: 0.10.1
    • lightning: 2.4.0
    • lightning-utilities: 0.11.7
    • llvmlite: 0.41.1
    • lxml: 4.8.0
    • lz4: 3.1.3+dfsg
    • markdown: 3.3.6
    • markupsafe: 2.0.1
    • matplotlib: 3.5.1
    • matplotlib-inline: 0.1.3
    • mccabe: 0.6.1
    • mistune: 3.0.2
    • ml-dtypes: 0.2.0
    • more-itertools: 8.10.0
    • mpmath: 0.0.0
    • msgpack: 1.0.3
    • multidict: 6.1.0
    • multimethod: 1.10
    • nbclient: 0.5.6
    • nbconvert: 7.12.0
    • nbformat: 5.9.2
    • nest-asyncio: 1.5.4
    • netifaces: 0.11.0
    • networkx: 2.4
    • nose: 1.3.7
    • notebook: 6.4.8
    • notebook-shim: 0.2.3
    • numba: 0.58.1
    • numexpr: 2.8.1
    • numpy: 1.25.2
    • nvidia-cublas-cu12: 12.1.3.1
    • nvidia-cuda-cupti-cu12: 12.1.105
    • nvidia-cuda-nvrtc-cu12: 12.1.105
    • nvidia-cuda-runtime-cu12: 12.1.105
    • nvidia-cudnn-cu12: 9.1.0.70
    • nvidia-cufft-cu12: 11.0.2.54
    • nvidia-curand-cu12: 10.3.2.106
    • nvidia-cusolver-cu12: 11.4.5.107
    • nvidia-cusparse-cu12: 12.1.0.106
    • nvidia-ml-py3: 7.352.0
    • nvidia-nccl-cu12: 2.20.5
    • nvidia-nvjitlink-cu12: 12.6.77
    • nvidia-nvtx-cu12: 12.1.105
    • oauthlib: 3.2.0
    • odfpy: 1.4.2
    • olefile: 0.46
    • openpyxl: 3.0.9
    • opt-einsum: 3.3.0
    • overrides: 7.4.0
    • packaging: 21.3
    • pandas: 1.3.5
    • pandas-profiling: 3.6.6
    • pandocfilters: 1.5.0
    • parso: 0.8.1
    • patsy: 0.5.4
    • pexpect: 4.8.0
    • phik: 0.12.3
    • pickleshare: 0.7.5
    • pillow: 9.0.1
    • pip: 23.3.1
    • platformdirs: 2.5.1
    • pluggy: 0.13.0
    • ply: 3.11
    • prometheus-client: 0.9.0
    • prompt-toolkit: 3.0.28
    • protobuf: 4.21.12
    • psutil: 5.9.0
    • ptyprocess: 0.7.0
    • py: 1.10.0
    • pyasn1: 0.4.8
    • pyasn1-modules: 0.2.1
    • pycodestyle: 2.8.0
    • pycparser: 2.21
    • pycryptodomex: 3.11.0
    • pydantic: 2.5.2
    • pydantic-core: 2.14.5
    • pyflakes: 2.4.0
    • pygments: 2.11.2
    • pygobject: 3.42.1
    • pyhamcrest: 2.0.2
    • pyinotify: 0.9.6
    • pyjwt: 2.3.0
    • pyopenssl: 21.0.0
    • pyparsing: 2.4.7
    • pyrsistent: 0.18.1
    • pyserial: 3.5
    • pysmi: 0.3.2
    • pysnmp: 4.4.12
    • pystache: 0.6.0
    • pytest: 6.2.5
    • python-apt: 2.4.0+ubuntu2
    • python-dateutil: 2.8.2
    • python-debian: 0.1.43+ubuntu1.1
    • python-json-logger: 2.0.7
    • python-magic: 0.4.24
    • pythran: 0.10.0
    • pytorch-lightning: 2.4.0
    • pytz: 2022.1
    • pywavelets: 1.5.0
    • pyyaml: 5.4.1
    • pyzmq: 25.1.2
    • referencing: 0.31.1
    • regex: 2024.9.11
    • requests: 2.31.0
    • requests-oauthlib: 1.3.0
    • rfc3339-validator: 0.1.4
    • rfc3986-validator: 0.1.1
    • rpds-py: 0.13.2
    • rsa: 4.8
    • safetensors: 0.4.5
    • scikit-learn: 0.23.2
    • scipy: 1.8.0
    • seaborn: 0.12.2
    • secretstorage: 3.3.1
    • send2trash: 1.8.2
    • service-identity: 18.1.0
    • setuptools: 59.6.0
    • simplejson: 3.17.6
    • six: 1.16.0
    • sniffio: 1.3.0
    • sos: 4.5.6
    • soupsieve: 2.3.1
    • ssh-import-id: 5.11
    • statsmodels: 0.14.0
    • sympy: 1.9
    • systemd-python: 234
    • tables: 3.7.0
    • tangled-up-in-unicode: 0.2.0
    • tensorboard: 2.13.0
    • tensorflow: 2.13.1
    • tensorflow-estimator: 2.13.0
    • termcolor: 1.1.0
    • terminado: 0.13.1
    • testpath: 0.5.0
    • threadpoolctl: 3.1.0
    • tinycss2: 1.2.1
    • tmuxp: 1.9.2
    • tokenizers: 0.20.0
    • toml: 0.10.2
    • tomli: 2.0.1
    • torch: 2.4.1
    • torch-summary: 1.4.5
    • torchmetrics: 1.4.2
    • torchvision: 0.15.2
    • tornado: 6.4
    • tqdm: 4.66.1
    • traitlets: 5.14.0
    • transformers: 4.45.1
    • triton: 3.0.0
    • twisted: 22.1.0
    • typeguard: 4.1.5
    • types-python-dateutil: 2.8.19.14
    • typing-extensions: 4.8.0
    • ubuntu-advantage-tools: 8001
    • ufolib2: 0.13.1
    • ufw: 0.36.1
    • unattended-upgrades: 0.1
    • unicodedata2: 14.0.0
    • uri-template: 1.3.0
    • urllib3: 1.26.5
    • virtualenv: 20.13.0+ds
    • visions: 0.7.5
    • wadllib: 1.3.6
    • wcwidth: 0.2.5
    • webcolors: 1.13
    • webencodings: 0.5.1
    • websocket-client: 1.2.3
    • werkzeug: 2.0.2
    • wheel: 0.37.1
    • widgetsnbextension: 4.0.9
    • wordcloud: 1.9.2
    • wrapt: 1.13.3
    • xlwt: 1.3.0
    • y-py: 0.6.2
    • yarl: 1.13.1
    • ydata-profiling: 4.6.3
    • ypy-websocket: 0.12.4
    • zipp: 1.0.0
    • zope.interface: 5.4.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.10.12
    • release: 6.2.0-37-generic
    • version: #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 2 18:01:13 UTC 2
#- PyTorch Lightning Version (e.g., 2.4.0): 2.4.0
#- PyTorch Version (e.g., 2.4): 2.4.1+cu121
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration: 1xA100 40GB
#- How you installed Lightning(`conda`, `pip`, source): pip install lightning

More info

No response

asigalov61 avatar Oct 03 '24 21:10 asigalov61

Avoid exit(1): In a Jupyter environment, exit() can cause problems. exit is possible in standard Python scripts, but should not be called in Jupyter notebooks. You can use sys.exit() instead: import sys sys.exit(1)

However, the recommended approach is to avoid using exit() or sys.exit() directly, especially in Jupyter notebook environments, where these commands can interrupt the kernel process and cause unnecessary problems.

nocoding03 avatar Oct 11 '24 05:10 nocoding03

@nocoding03 My code/notebook does not use or calls exit. The problem is in the pytroch lightning module.

If you will double-check the provided traceback, you will see that the error comes from ~/.local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py module.

asigalov61 avatar Oct 25 '24 08:10 asigalov61

I also see that issue in lightning v2.4.0 and torch v2.5.1 while training in jupyter nb. Once stopping the training run, instead of performing gracefully shutdown, I get this error

NameError                                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     62         if isinstance(launcher, _SubprocessScriptLauncher):
     63             launcher.kill(_get_sigkill_signal())
---> 64         exit(1)
     65 
     66     except BaseException as exception:

NameError: name ‘exit’ is not defined

seems to be an issue with lightning not importing exit from sys (exit(0)) not defined

ori-kron-wis avatar Dec 19 '24 14:12 ori-kron-wis

same issue in 2.5.0 - but it even fails when defining the trainer and kills the kernel

odusseys avatar Dec 31 '24 10:12 odusseys

What's the status of this? The bug was reported 5 months ago in that specific branch https://github.com/Lightning-AI/pytorch-lightning/pull/19976 authored by @awaelchli and approved by @lantiga. There seems to be no activity in fixing this. My understanding is that importing exit from sys should be sufficient to fix it but I might miss something.

canergen avatar Jan 13 '25 21:01 canergen

Use this by now:

try:
    trainer.fit(model, train_loader, val_loader)
except NameError as e:
    import gc
    gc.collect()
    torch.cuda.empty_cache()

jamartinh avatar May 06 '25 20:05 jamartinh