pytorch-lightning
pytorch-lightning copied to clipboard
[Bug] RuntimeError: No backend type associated with device type cpu
Bug description
On upgrading torch and lightning to both 2.1.0, and running DDP leads to the following error trace,
# Error messages and logs here please
23 Traceback (most recent call last):
24 File "/home/nikhil_valencediscovery_com/projects/openMLIP/src/mlip/train.py", line 126, in main
25 train(cfg)
26 File "/home/nikhil_valencediscovery_com/projects/openMLIP/src/mlip/train.py", line 102, in train
27 trainer.fit(model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
28 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 545, in fit
29 call._call_and_handle_interrupt(
30 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
31 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
32 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
33 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
34 return function(*args, **kwargs)
35 ^^^^^^^^^^^^^^^^^^^^^^^^^
36 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 581, in _fit_impl
37 self._run(model, ckpt_path=ckpt_path)
38 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 990, in _run
39 results = self._run_stage()
40 ^^^^^^^^^^^^^^^^^
41 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1034, in _run_stage
42 self._run_sanity_check()
43 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1063, in _run_sanity_check
44 val_loop.run()
45 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 181, in _decorator
46 return loop_run(self, *args, **kwargs)
47 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
48 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 141, in run
49 return self.on_run_end()
50 ^^^^^^^^^^^^^^^^^
51 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 253, in on_run_end
52 self._on_evaluation_epoch_end()
53 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 331, in _on_evaluation_epoch_end
54 trainer._logger_connector.on_epoch_end()
55 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 187, in on_epoch_end
56 metrics = self.metrics
57 ^^^^^^^^^^^^
58 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 226, in metrics
59 return self.trainer._results.metrics(on_step)
60 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
61 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 471, in metrics
62 value = self._get_cache(result_metric, on_step)
63 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
64 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 435, in _get_cache
65 result_metric.compute()
66 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 280, in wrapped_func
67 self._computed = compute(*args, **kwargs)
68 ^^^^^^^^^^^^^^^^^^^^^^^^
69 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 243, in compute
70 value = self.meta.sync(self.value.clone()) # `clone` because `sync` is in-place
71 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
72 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 330, in reduce
73 return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
74 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
75 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available
76 return _sync_ddp(result, group=group, reduce_op=reduce_op)
77 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
78 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp
79 torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
80 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
81 return func(*args, **kwargs)
82 ^^^^^^^^^^^^^^^^^^^^^
83 File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
84 work = group.allreduce([tensor], opts)
85 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
86 RuntimeError: No backend type associated with device type cpu
87 Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
On downgrading lightning to 2.0.1
, the error goes away.
What version are you seeing the problem on?
master
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
<details>
<summary>Current environment</summary>
* CUDA:
- GPU: None
- available: False
- version: 11.8
* Lightning:
- lightning: 2.0.1.post0
- lightning-cloud: 0.5.42
- lightning-utilities: 0.9.0
- pytorch-lightning: 2.1.0
- torch: 2.1.0
- torch-cluster: 1.6.3
- torch-geometric: 2.4.0
- torch-scatter: 2.1.2
- torch-sparse: 0.6.18
- torchmetrics: 1.2.0
* Packages:
- absl-py: 2.0.0
- aiobotocore: 2.5.4
- aiohttp: 3.8.6
- aioitertools: 0.11.0
- aiosignal: 1.3.1
- antlr4-python3-runtime: 4.9.3
- anyio: 3.7.1
- appdirs: 1.4.4
- argon2-cffi: 23.1.0
- argon2-cffi-bindings: 21.2.0
- arrow: 1.3.0
- ase: 3.22.1
- asttokens: 2.4.0
- async-lru: 2.0.4
- async-timeout: 4.0.3
- attrs: 23.1.0
- babel: 2.13.0
- backcall: 0.2.0
- backoff: 2.2.1
- backports.cached-property: 1.0.2
- backports.functools-lru-cache: 1.6.5
- beautifulsoup4: 4.12.2
- black: 23.9.1
- bleach: 6.1.0
- blessed: 1.19.1
- blinker: 1.6.3
- boto3: 1.28.17
- botocore: 1.31.17
- brotli: 1.1.0
- build: 0.10.0
- cachecontrol: 0.12.14
- cached-property: 1.5.2
- cachetools: 5.3.1
- certifi: 2023.7.22
- cffi: 1.16.0
- cfgv: 3.3.1
- charset-normalizer: 3.3.0
- cleo: 2.0.1
- click: 8.1.7
- colorama: 0.4.6
- comm: 0.1.4
- contourpy: 1.1.1
- coverage: 7.3.2
- crashtest: 0.4.1
- croniter: 1.3.15
- cryptography: 41.0.4
- cycler: 0.12.1
- datamol: 0.0.0
- dateutils: 0.6.12
- debugpy: 1.8.0
- decorator: 5.1.1
- deepdiff: 6.6.0
- defusedxml: 0.7.1
- distlib: 0.3.7
- docker-pycreds: 0.4.0
- dulwich: 0.21.6
- e3nn: 0.5.1
- einops: 0.6.0
- entrypoints: 0.4
- exceptiongroup: 1.1.3
- executing: 1.2.0
- fastapi: 0.88.0
- fastjsonschema: 2.18.1
- filelock: 3.12.4
- flask: 3.0.0
- fonttools: 4.43.1
- fqdn: 1.5.1
- freetype-py: 2.3.0
- frozenlist: 1.4.0
- fsspec: 2023.9.2
- gcsfs: 2023.9.2
- gitdb: 4.0.10
- gitpython: 3.1.37
- gmpy2: 2.1.2
- google-api-core: 2.12.0
- google-auth: 2.23.3
- google-auth-oauthlib: 0.4.6
- google-cloud-core: 2.3.3
- google-cloud-storage: 2.12.0
- google-crc32c: 1.1.2
- google-resumable-media: 2.6.0
- googleapis-common-protos: 1.61.0
- greenlet: 3.0.0
- grpcio: 1.59.1
- h11: 0.14.0
- h5py: 3.10.0
- html5lib: 1.1
- hydra-core: 1.3.2
- identify: 2.5.30
- idna: 3.4
- importlib-metadata: 6.8.0
- importlib-resources: 6.1.0
- iniconfig: 2.0.0
- inquirer: 3.1.3
- installer: 0.7.0
- ipdb: 0.13.13
- ipykernel: 6.25.2
- ipython: 8.16.1
- ipywidgets: 8.1.1
- isoduration: 20.11.0
- itsdangerous: 2.1.2
- jaraco.classes: 3.3.0
- jedi: 0.19.1
- jeepney: 0.8.0
- jinja2: 3.1.2
- jmespath: 1.0.1
- joblib: 1.3.2
- json5: 0.9.14
- jsonpointer: 2.4
- jsonschema: 4.19.1
- jsonschema-specifications: 2023.7.1
- jupyter-client: 8.4.0
- jupyter-core: 5.4.0
- jupyter-events: 0.7.0
- jupyter-lsp: 2.2.0
- jupyter-server: 2.7.3
- jupyter-server-terminals: 0.4.4
- jupyterlab: 4.0.7
- jupyterlab-pygments: 0.2.2
- jupyterlab-server: 2.25.0
- jupyterlab-widgets: 3.0.9
- keyring: 23.13.1
- kiwisolver: 1.4.5
- lightning: 2.0.1.post0
- lightning-cloud: 0.5.42
- lightning-utilities: 0.9.0
- lockfile: 0.12.2
- loguru: 0.7.2
- markdown: 3.5
- markdown-it-py: 3.0.0
- markupsafe: 2.1.3
- matplotlib: 3.8.0
- matplotlib-inline: 0.1.6
- matscipy: 0.7.0
- mdurl: 0.1.0
- mistune: 3.0.1
- mlip: 0.0.1.dev157+gc3d9c0b.d20231016
- more-itertools: 10.1.0
- mpmath: 1.3.0
- msgpack: 1.0.6
- multidict: 6.0.4
- munkres: 1.1.4
- mypy-extensions: 1.0.0
- nbclient: 0.8.0
- nbconvert: 7.9.2
- nbformat: 5.9.2
- nest-asyncio: 1.5.8
- networkx: 3.1
- nodeenv: 1.8.0
- notebook-shim: 0.2.3
- numpy: 1.26.0
- oauthlib: 3.2.2
- omegaconf: 2.3.0
- openqdc: 0.0.0
- opt-einsum: 3.3.0
- opt-einsum-fx: 0.1.4
- ordered-set: 4.1.0
- orjson: 3.9.8
- overrides: 7.4.0
- packaging: 23.2
- pandas: 2.1.1
- pandocfilters: 1.5.0
- parso: 0.8.3
- pathspec: 0.11.2
- pathtools: 0.1.2
- patsy: 0.5.3
- pexpect: 4.8.0
- pickleshare: 0.7.5
- pillow: 10.1.0
- pip: 23.3
- pkginfo: 1.9.6
- pkgutil-resolve-name: 1.3.10
- platformdirs: 3.11.0
- pluggy: 1.3.0
- ply: 3.11
- poetry: 1.5.1
- poetry-core: 1.6.1
- poetry-plugin-export: 1.5.0
- pre-commit: 3.5.0
- prettytable: 3.9.0
- prometheus-client: 0.17.1
- prompt-toolkit: 3.0.39
- protobuf: 4.24.4
- psutil: 5.9.5
- ptyprocess: 0.7.0
- pure-eval: 0.2.2
- pyasn1: 0.5.0
- pyasn1-modules: 0.3.0
- pycairo: 1.25.0
- pycparser: 2.21
- pydantic: 1.10.13
- pygments: 2.16.1
- pyjwt: 2.8.0
- pyopenssl: 23.2.0
- pyparsing: 3.1.1
- pyproject-hooks: 1.0.0
- pyqt5: 5.15.9
- pyqt5-sip: 12.12.2
- pyrootutils: 1.0.4
- pysocks: 1.7.1
- pytest: 7.4.2
- pytest-cov: 4.1.0
- python-dateutil: 2.8.2
- python-dotenv: 1.0.0
- python-editor: 1.0.4
- python-json-logger: 2.0.7
- python-multipart: 0.0.6
- pytorch-lightning: 2.1.0
- pytz: 2023.3.post1
- pyu2f: 0.1.5
- pyyaml: 6.0.1
- pyzmq: 25.1.1
- rapidfuzz: 2.15.2
- readchar: 4.0.5.dev0
- referencing: 0.30.2
- reportlab: 4.0.6
- requests: 2.31.0
- requests-oauthlib: 1.3.1
- requests-toolbelt: 1.0.0
- rfc3339-validator: 0.1.4
- rfc3986-validator: 0.1.1
- rich: 13.6.0
- rlpycairo: 0.2.0
- rpds-py: 0.10.6
- rsa: 4.9
- ruff: 0.0.292
- s3fs: 2023.9.2
- s3transfer: 0.6.2
- scikit-learn: 1.3.1
- scipy: 1.11.3
- seaborn: 0.13.0
- secretstorage: 3.3.3
- selfies: 2.1.1
- send2trash: 1.8.2
- sentry-sdk: 1.32.0
- setproctitle: 1.3.3
- setuptools: 68.2.2
- shellingham: 1.5.3
- sip: 6.7.12
- six: 1.16.0
- smmap: 3.0.5
- sniffio: 1.3.0
- soupsieve: 2.5
- sqlalchemy: 2.0.22
- stack-data: 0.6.2
- starlette: 0.22.0
- starsessions: 1.3.0
- statsmodels: 0.14.0
- sympy: 1.12
- tensorboard: 2.11.2
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- terminado: 0.17.1
- threadpoolctl: 3.2.0
- tinycss2: 1.2.1
- toml: 0.10.2
- tomli: 2.0.1
- tomlkit: 0.12.1
- torch: 2.1.0
- torch-cluster: 1.6.3
- torch-geometric: 2.4.0
- torch-scatter: 2.1.2
- torch-sparse: 0.6.18
- torchmetrics: 1.2.0
- tornado: 6.3.3
- tqdm: 4.66.1
- traitlets: 5.11.2
- triton: 2.1.0
- trove-classifiers: 2023.9.19
- types-python-dateutil: 2.8.19.14
- typing-extensions: 4.8.0
- typing-utils: 0.1.0
- tzdata: 2023.3
- ukkonen: 1.0.1
- uri-template: 1.3.0
- urllib3: 1.26.17
- uvicorn: 0.23.2
- virtualenv: 20.24.4
- wandb: 0.15.12
- wcwidth: 0.2.8
- webcolors: 1.13
- webencodings: 0.5.1
- websocket-client: 1.6.4
- websockets: 11.0.3
- werkzeug: 3.0.0
- wheel: 0.41.2
- widgetsnbextension: 4.0.9
- wrapt: 1.15.0
- yarl: 1.9.2
- zipp: 3.17.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.11.6
- release: 5.15.0-1032-gcp
- version: #40~20.04.1-Ubuntu SMP Tue Apr 11 02:49:52 UTC 2023
</details>
More info
No response
I can confirm that I also experiencing this bug. Downgrading to 2.0.8 fixes it
Current env:
lightning 2.1.0
lightning-cloud 0.5.42
lightning-utilities 0.9.0
pytorch-lightning 2.1.0
torch 2.1.0+cu118
torchmetrics 1.2.0
torchvision 0.16.0+cu118
Stack:
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 545, in fit
call._call_and_handle_interrupt(
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, **kwargs)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 581, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 990, in _run
results = self._run_stage()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1034, in _run_stage
self._run_sanity_check()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1063, in _run_sanity_check
val_loop.run()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/utilities.py", line 181, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 141, in run
return self.on_run_end()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 253, in on_run_end
self._on_evaluation_epoch_end()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 331, in _on_evaluation_epoch_end
trainer._logger_connector.on_epoch_end()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 187, in on_epoch_end
Traceback (most recent call last):
metrics = self.metrics
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 226, in metrics
return self.trainer._results.metrics(on_step)
File "/mnt/hdd1/users/Documents/dev/sandbox_dsm_scripts/train/train.py", line 91, in <module>
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 471, in metrics
value = self._get_cache(result_metric, on_step)
pytroch()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 435, in _get_cache
File "/mnt/hdd1/users/Documents/dev/sandbox_dsm_scripts/train/train.py", line 73, in pytroch
result_metric.compute()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 280, in wrapped_func
trainer.train_cls_locally(
File "/mnt/hdd1/users/Documents/dev/trainer.py", line 927, in train_cls_locally
self._computed = compute(*args, **kwargs)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 245, in compute
cumulated_batch_size = self.meta.sync(self.cumulated_batch_size)
trainer.fit(model, datamodule=datamodule)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 330, in reduce
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 545, in fit
return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available
call._call_and_handle_interrupt(
return _sync_ddp(result, group=group, reduce_op=reduce_op)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1536, in all_reduce
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, **kwargs)
work = group.allreduce([tensor], opts)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 581, in _fit_impl
RuntimeError: Tensors must be CUDA and dense
self._run(model, ckpt_path=ckpt_path)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 990, in _run
results = self._run_stage()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1034, in _run_stage
self._run_sanity_check()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1063, in _run_sanity_check
val_loop.run()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/utilities.py", line 181, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 141, in run
return self.on_run_end()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 253, in on_run_end
self._on_evaluation_epoch_end()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 331, in _on_evaluation_epoch_end
trainer._logger_connector.on_epoch_end()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 187, in on_epoch_end
metrics = self.metrics
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 226, in metrics
return self.trainer._results.metrics(on_step)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 471, in metrics
value = self._get_cache(result_metric, on_step)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 435, in _get_cache
result_metric.compute()
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 280, in wrapped_func
self._computed = compute(*args, **kwargs)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 245, in compute
cumulated_batch_size = self.meta.sync(self.cumulated_batch_size)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 330, in reduce
return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available
return _sync_ddp(result, group=group, reduce_op=reduce_op)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp
torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1536, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense
I see this as well
I started seeing this error but couldn't figure out what has caused it. It appears after the first validation epoch, apparently when computing a metric in the on_epoch_end
callback. Downgrading to 2.0.8 helped.
Same bug here after upgrade to torch==2.1.0 and lightning==2.1.0.
This bug appeared when running Metric.compute()
of a torchmetric
after a validation epoch.
Edit: I am using lightning fabric instead of lightning trainer. This bug is also triggered.
Same for me. Downgrading to pytorch-lightning==2.0.8
fixed the issue.
I've got the same error on torch==2.1.0
and lightning==2.1.0
and fixed when downgrading to pytorch_lightning==2.0.8
I also just ran across this error. It seems like the self.log(key, val) calls have changed in some way, as in my case the error went away if I manually moved val to the GPU in every call of self.log in my code
My feeling is that the DDP strategy in lightning==2.0.8
initialized distributed backends for both CPU and GPU when running with device=GPU. Below is a minimal example that works with 2.0.8, but crashes in 2.1.0:
import torch
from lightning import Trainer, LightningModule
from torch.utils.data import DataLoader
class LitModel(LightningModule):
def __init__(self) -> None:
super().__init__()
self.layer = torch.nn.Linear(1, 1)
def training_step(self, x):
# Everything but the next line is just dummy-code to make it run
self.log(
"foo", value=torch.zeros(1, device="cpu"), on_step=True, sync_dist=True
)
loss = self.layer(x).mean()
return loss
def configure_optimizers(self):
return torch.optim.SGD(self.parameters(), lr=0.1)
def train_dataloader(self):
return DataLoader(torch.randn(32, 1), batch_size=1)
def main():
model = LitModel()
trainer = Trainer(devices=2, accelerator="gpu", max_epochs=2)
trainer.fit(model)
if __name__ == "__main__":
main()
Note that this isn't restricted to distributed code that's run by lightning. We have some functionality that uses torch.distributed
directly and are running into the same exact issue when we try to broadcast non-CUDA tensors.
Has this issue been addressed in nightly? I was really trying to stick to either pip or conda versions and it looks like 2.0.8 is not available on either.
Same issue with PyTorch 2.1.1 and Lightning 2.1.2
It looks like the change comes from this PR: #17334 (git-bisecting code sample by @dsuess)
It looks like the changes was intentional. The changelog says:
self.log
ed tensors are now kept in the original device to reduce unnecessary host-to-device synchronizations (#17334)
This means if you pass in the tensor, it already needs to be on the right device and the user needs to explicitly perform the .to()
call.
cc @carmocca
The resolution is not clear to me. I'm getting the message "RuntimeError: No backend type associated with device type cpu". If I was logging 20 things some of them on CPU some of GPU what should I be doing? From your comment @awaelchli I would've thought adding .to('cpu')
calls but the error message makes me thing the opposite (but moving CPU results back to GPU also seems silly).
If I understood correctly, when using self.log(..., sync_dist=True)
with DDP, you have to transfer the tensor to the GPU before logging.
Is it possible to move the tensors to the correct device automatically in LightningModule.log()
? If not, I feel like this should be mentioned in the documentation, and it would be good to give a better error message. Currently the 15-minute Lightning tutorial instructs to remove any .cuda()
or device calls, because LightningModules are hardware agnostic.
@awaelchli Thanks for clarifying. I've found another corner case where the new behaviour breaks existing code: If you re-use a trainer
instance multiple times (e.g. for evaluating multiple epochs), you can end up with metrics moved to CPU even if you log them with GPU tensors.
The reason being that the logger connector moves all intermediate results to CPU on teardown. So on the second call to trainer.validate
, the helper-state (e.g. cumulated_batch_size) of the cached results are on CPU. This can be fixed by removing all cached results through
trainer.validate_loop._results.clear()
Here's a full example to reproduce this:
import torch
from lightning import Trainer, LightningModule
from torch.utils.data import DataLoader
class LitModel(LightningModule):
def __init__(self) -> None:
super().__init__()
self.layer = torch.nn.Linear(1, 1)
def training_step(self, x):
loss = self.layer(x).mean()
return loss
def validation_step(self, *args, **kwargs):
self.log(
"foo", value=torch.zeros(1, device=self.device), on_step=True, sync_dist=True
)
return super().validation_step(*args, **kwargs)
def configure_optimizers(self):
return torch.optim.SGD(self.parameters(), lr=0.1)
def val_dataloader(self):
return DataLoader(torch.randn(32, 1), batch_size=1)
def main():
model = LitModel()
trainer = Trainer(devices=2, accelerator="gpu", max_epochs=2)
trainer.validate(model)
# Uncomment the following line to fix the issue
#trainer.validate_loop._results.clear()
trainer.validate(model)
if __name__ == "__main__":
main()
The reason being that the logger connector moves all intermediate results to CPU on teardown. So on the second call to
trainer.validate
, the helper-state (e.g. cumulated_batch_size) of the cached results are on CPU. This can be fixed by removing all cached results throughtrainer.validate_loop._results.clear()
If you want to call trainer.fit
twice, the analogue fix is:
trainer.fit_loop.epoch_loop.val_loop._results.clear()
I'm having the same issue using the latest version and resolved by downgrading to lightning==2.0.9
.
I've solved the issue on lightning==2.1.3 . When rewriting any epoch_end function, if you log, just make sure that the tensor is on gpu device. If you initialize new tensor, initialize it with device=self.device
I've solved the issue on lightning==2.1.3 . When rewriting any epoch_end function, if you log, just make sure that the tensor is on gpu device. If you initialize new tensor, initialize it with device=self.device
@ouioui199 suggestion works. I changed my code from
self.log_dict( {f"test_map_{label}": value for label, value in zip(self.id2label.values(), mAP_per_class)}, sync_dist=True, )
to
self.log_dict( {f"test_map_{label}": value.to("cuda") for label, value in zip(self.id2label.values(), mAP_per_class)}, sync_dist=True, )
I've solved the issue on lightning==2.1.3 . When rewriting any epoch_end function, if you log, just make sure that the tensor is on gpu device. If you initialize new tensor, initialize it with device=self.device
@ouioui199 suggestion works. I changed my code from
self.log_dict( {f"test_map_{label}": value for label, value in zip(self.id2label.values(), mAP_per_class)}, sync_dist=True, )
to
self.log_dict( {f"test_map_{label}": value.to("cuda") for label, value in zip(self.id2label.values(), mAP_per_class)}, sync_dist=True, )
This is really helpful. This does have something to do with the torchmetrics and ddp processes. I use Callback for logging purposes and just change the log value to the device:
class AwesomeCallback(Callback):
def on_validation_epoch_end(self, trainer: pl.Trainer, pl_module: pl.LightningModule):
pl_module.log("some metrics", some_value.to(pl_module.device), sync_dist=True)
This error can also be reproduced using torchmetrics
1.3.2 when storing lists of tensors in CPU compute_on_cpu=True
Got the same error, when trying to compute a confusion matrix on a callback
when I call metric.plot()
:
def on_test_epoch_end(self) -> None:
metric = MulticlassConfusionMatrix(num_classes=self.num_classes).to("cpu")
outputs = torch.cat(self.x_test, dim=0).to("cpu")
labels = torch.cat(self.y_test, dim=0).to("cpu")
outputs = torch.softmax(outputs, dim=1).argmax(dim=1)
metric.update(outputs, labels)
pl = ["Latin", "Russian", "Arabic", "Chinese"]
fig_, ax_ = metric.plot(labels=pl)
fig_.savefig("test.png")
Same bug here after upgrade to torch==2.1.0 and lightning==2.1.0.
This bug appeared when running
Metric.compute()
of atorchmetric
after a validation epoch.Edit: I am using lightning fabric instead of lightning trainer. This bug is also triggered.
For me, I also saw this on Metric.compute()
. It happened when I was running integration tests where one used a DDPStraregy
and the other used a single process strategy on the cpu. After distributed process groups are created, an error seems to be raised if a metric is computed on the cpu.
import lightning
import torch
import torchmetrics
from torch import nn
fabric = lightning.Fabric(accelerator="cuda", devices=2)
fabric.launch()
module = nn.Linear(2, 1)
module = fabric.setup(module)
metric = torchmetrics.Accuracy(task="multiclass", num_classes=2)
metric.update(torch.tensor([0., 1.]), torch.tensor([0, 1]))
metric.compute()
RuntimeError:
No backend type associated with device type cpu