pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

[Bug] RuntimeError: No backend type associated with device type cpu

Open shenoynikhil opened this issue 1 year ago • 26 comments

Bug description

On upgrading torch and lightning to both 2.1.0, and running DDP leads to the following error trace,

# Error messages and logs here please
23 Traceback (most recent call last):
24   File "/home/nikhil_valencediscovery_com/projects/openMLIP/src/mlip/train.py", line 126, in main
25     train(cfg)
26   File "/home/nikhil_valencediscovery_com/projects/openMLIP/src/mlip/train.py", line 102, in train
27     trainer.fit(model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
28   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 545, in fit
29     call._call_and_handle_interrupt(
30   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
31     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
32            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
33   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
34     return function(*args, **kwargs)
35            ^^^^^^^^^^^^^^^^^^^^^^^^^
36   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 581, in _fit_impl
37     self._run(model, ckpt_path=ckpt_path)
38   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 990, in _run
39     results = self._run_stage()
40               ^^^^^^^^^^^^^^^^^
41   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1034, in _run_stage
42     self._run_sanity_check()
43   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1063, in _run_sanity_check
44     val_loop.run()
45   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 181, in _decorator
46     return loop_run(self, *args, **kwargs)
47            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
48   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 141, in run
49     return self.on_run_end()
50            ^^^^^^^^^^^^^^^^^
51   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 253, in on_run_end
52     self._on_evaluation_epoch_end()
53   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 331, in _on_evaluation_epoch_end
54     trainer._logger_connector.on_epoch_end()
55   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 187, in on_epoch_end
56     metrics = self.metrics
57               ^^^^^^^^^^^^
58   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 226, in metrics
59     return self.trainer._results.metrics(on_step)
60            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
61   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 471, in metrics
62     value = self._get_cache(result_metric, on_step)
63             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
64   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 435, in _get_cache
65     result_metric.compute()
66   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 280, in wrapped_func
67     self._computed = compute(*args, **kwargs)
68                      ^^^^^^^^^^^^^^^^^^^^^^^^
69   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 243, in compute
70     value = self.meta.sync(self.value.clone())  # `clone` because `sync` is in-place
71             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
72   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 330, in reduce
73     return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
74            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
75   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available
76     return _sync_ddp(result, group=group, reduce_op=reduce_op)
77            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
78   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp
79     torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
80   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
81     return func(*args, **kwargs)
82            ^^^^^^^^^^^^^^^^^^^^^
83   File "/home/nikhil_valencediscovery_com/local/conda/envs/mlip4/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
84     work = group.allreduce([tensor], opts)
85            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
86 RuntimeError: No backend type associated with device type cpu
87 Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

On downgrading lightning to 2.0.1, the error goes away.

What version are you seeing the problem on?

master

How to reproduce the bug

No response

Error messages and logs

Environment

Current environment
<details>
  <summary>Current environment</summary>

* CUDA:
	- GPU:               None
	- available:         False
	- version:           11.8
* Lightning:
	- lightning:         2.0.1.post0
	- lightning-cloud:   0.5.42
	- lightning-utilities: 0.9.0
	- pytorch-lightning: 2.1.0
	- torch:             2.1.0
	- torch-cluster:     1.6.3
	- torch-geometric:   2.4.0
	- torch-scatter:     2.1.2
	- torch-sparse:      0.6.18
	- torchmetrics:      1.2.0
* Packages:
	- absl-py:           2.0.0
	- aiobotocore:       2.5.4
	- aiohttp:           3.8.6
	- aioitertools:      0.11.0
	- aiosignal:         1.3.1
	- antlr4-python3-runtime: 4.9.3
	- anyio:             3.7.1
	- appdirs:           1.4.4
	- argon2-cffi:       23.1.0
	- argon2-cffi-bindings: 21.2.0
	- arrow:             1.3.0
	- ase:               3.22.1
	- asttokens:         2.4.0
	- async-lru:         2.0.4
	- async-timeout:     4.0.3
	- attrs:             23.1.0
	- babel:             2.13.0
	- backcall:          0.2.0
	- backoff:           2.2.1
	- backports.cached-property: 1.0.2
	- backports.functools-lru-cache: 1.6.5
	- beautifulsoup4:    4.12.2
	- black:             23.9.1
	- bleach:            6.1.0
	- blessed:           1.19.1
	- blinker:           1.6.3
	- boto3:             1.28.17
	- botocore:          1.31.17
	- brotli:            1.1.0
	- build:             0.10.0
	- cachecontrol:      0.12.14
	- cached-property:   1.5.2
	- cachetools:        5.3.1
	- certifi:           2023.7.22
	- cffi:              1.16.0
	- cfgv:              3.3.1
	- charset-normalizer: 3.3.0
	- cleo:              2.0.1
	- click:             8.1.7
	- colorama:          0.4.6
	- comm:              0.1.4
	- contourpy:         1.1.1
	- coverage:          7.3.2
	- crashtest:         0.4.1
	- croniter:          1.3.15
	- cryptography:      41.0.4
	- cycler:            0.12.1
	- datamol:           0.0.0
	- dateutils:         0.6.12
	- debugpy:           1.8.0
	- decorator:         5.1.1
	- deepdiff:          6.6.0
	- defusedxml:        0.7.1
	- distlib:           0.3.7
	- docker-pycreds:    0.4.0
	- dulwich:           0.21.6
	- e3nn:              0.5.1
	- einops:            0.6.0
	- entrypoints:       0.4
	- exceptiongroup:    1.1.3
	- executing:         1.2.0
	- fastapi:           0.88.0
	- fastjsonschema:    2.18.1
	- filelock:          3.12.4
	- flask:             3.0.0
	- fonttools:         4.43.1
	- fqdn:              1.5.1
	- freetype-py:       2.3.0
	- frozenlist:        1.4.0
	- fsspec:            2023.9.2
	- gcsfs:             2023.9.2
	- gitdb:             4.0.10
	- gitpython:         3.1.37
	- gmpy2:             2.1.2
	- google-api-core:   2.12.0
	- google-auth:       2.23.3
	- google-auth-oauthlib: 0.4.6
	- google-cloud-core: 2.3.3
	- google-cloud-storage: 2.12.0
	- google-crc32c:     1.1.2
	- google-resumable-media: 2.6.0
	- googleapis-common-protos: 1.61.0
	- greenlet:          3.0.0
	- grpcio:            1.59.1
	- h11:               0.14.0
	- h5py:              3.10.0
	- html5lib:          1.1
	- hydra-core:        1.3.2
	- identify:          2.5.30
	- idna:              3.4
	- importlib-metadata: 6.8.0
	- importlib-resources: 6.1.0
	- iniconfig:         2.0.0
	- inquirer:          3.1.3
	- installer:         0.7.0
	- ipdb:              0.13.13
	- ipykernel:         6.25.2
	- ipython:           8.16.1
	- ipywidgets:        8.1.1
	- isoduration:       20.11.0
	- itsdangerous:      2.1.2
	- jaraco.classes:    3.3.0
	- jedi:              0.19.1
	- jeepney:           0.8.0
	- jinja2:            3.1.2
	- jmespath:          1.0.1
	- joblib:            1.3.2
	- json5:             0.9.14
	- jsonpointer:       2.4
	- jsonschema:        4.19.1
	- jsonschema-specifications: 2023.7.1
	- jupyter-client:    8.4.0
	- jupyter-core:      5.4.0
	- jupyter-events:    0.7.0
	- jupyter-lsp:       2.2.0
	- jupyter-server:    2.7.3
	- jupyter-server-terminals: 0.4.4
	- jupyterlab:        4.0.7
	- jupyterlab-pygments: 0.2.2
	- jupyterlab-server: 2.25.0
	- jupyterlab-widgets: 3.0.9
	- keyring:           23.13.1
	- kiwisolver:        1.4.5
	- lightning:         2.0.1.post0
	- lightning-cloud:   0.5.42
	- lightning-utilities: 0.9.0
	- lockfile:          0.12.2
	- loguru:            0.7.2
	- markdown:          3.5
	- markdown-it-py:    3.0.0
	- markupsafe:        2.1.3
	- matplotlib:        3.8.0
	- matplotlib-inline: 0.1.6
	- matscipy:          0.7.0
	- mdurl:             0.1.0
	- mistune:           3.0.1
	- mlip:              0.0.1.dev157+gc3d9c0b.d20231016
	- more-itertools:    10.1.0
	- mpmath:            1.3.0
	- msgpack:           1.0.6
	- multidict:         6.0.4
	- munkres:           1.1.4
	- mypy-extensions:   1.0.0
	- nbclient:          0.8.0
	- nbconvert:         7.9.2
	- nbformat:          5.9.2
	- nest-asyncio:      1.5.8
	- networkx:          3.1
	- nodeenv:           1.8.0
	- notebook-shim:     0.2.3
	- numpy:             1.26.0
	- oauthlib:          3.2.2
	- omegaconf:         2.3.0
	- openqdc:           0.0.0
	- opt-einsum:        3.3.0
	- opt-einsum-fx:     0.1.4
	- ordered-set:       4.1.0
	- orjson:            3.9.8
	- overrides:         7.4.0
	- packaging:         23.2
	- pandas:            2.1.1
	- pandocfilters:     1.5.0
	- parso:             0.8.3
	- pathspec:          0.11.2
	- pathtools:         0.1.2
	- patsy:             0.5.3
	- pexpect:           4.8.0
	- pickleshare:       0.7.5
	- pillow:            10.1.0
	- pip:               23.3
	- pkginfo:           1.9.6
	- pkgutil-resolve-name: 1.3.10
	- platformdirs:      3.11.0
	- pluggy:            1.3.0
	- ply:               3.11
	- poetry:            1.5.1
	- poetry-core:       1.6.1
	- poetry-plugin-export: 1.5.0
	- pre-commit:        3.5.0
	- prettytable:       3.9.0
	- prometheus-client: 0.17.1
	- prompt-toolkit:    3.0.39
	- protobuf:          4.24.4
	- psutil:            5.9.5
	- ptyprocess:        0.7.0
	- pure-eval:         0.2.2
	- pyasn1:            0.5.0
	- pyasn1-modules:    0.3.0
	- pycairo:           1.25.0
	- pycparser:         2.21
	- pydantic:          1.10.13
	- pygments:          2.16.1
	- pyjwt:             2.8.0
	- pyopenssl:         23.2.0
	- pyparsing:         3.1.1
	- pyproject-hooks:   1.0.0
	- pyqt5:             5.15.9
	- pyqt5-sip:         12.12.2
	- pyrootutils:       1.0.4
	- pysocks:           1.7.1
	- pytest:            7.4.2
	- pytest-cov:        4.1.0
	- python-dateutil:   2.8.2
	- python-dotenv:     1.0.0
	- python-editor:     1.0.4
	- python-json-logger: 2.0.7
	- python-multipart:  0.0.6
	- pytorch-lightning: 2.1.0
	- pytz:              2023.3.post1
	- pyu2f:             0.1.5
	- pyyaml:            6.0.1
	- pyzmq:             25.1.1
	- rapidfuzz:         2.15.2
	- readchar:          4.0.5.dev0
	- referencing:       0.30.2
	- reportlab:         4.0.6
	- requests:          2.31.0
	- requests-oauthlib: 1.3.1
	- requests-toolbelt: 1.0.0
	- rfc3339-validator: 0.1.4
	- rfc3986-validator: 0.1.1
	- rich:              13.6.0
	- rlpycairo:         0.2.0
	- rpds-py:           0.10.6
	- rsa:               4.9
	- ruff:              0.0.292
	- s3fs:              2023.9.2
	- s3transfer:        0.6.2
	- scikit-learn:      1.3.1
	- scipy:             1.11.3
	- seaborn:           0.13.0
	- secretstorage:     3.3.3
	- selfies:           2.1.1
	- send2trash:        1.8.2
	- sentry-sdk:        1.32.0
	- setproctitle:      1.3.3
	- setuptools:        68.2.2
	- shellingham:       1.5.3
	- sip:               6.7.12
	- six:               1.16.0
	- smmap:             3.0.5
	- sniffio:           1.3.0
	- soupsieve:         2.5
	- sqlalchemy:        2.0.22
	- stack-data:        0.6.2
	- starlette:         0.22.0
	- starsessions:      1.3.0
	- statsmodels:       0.14.0
	- sympy:             1.12
	- tensorboard:       2.11.2
	- tensorboard-data-server: 0.6.1
	- tensorboard-plugin-wit: 1.8.1
	- terminado:         0.17.1
	- threadpoolctl:     3.2.0
	- tinycss2:          1.2.1
	- toml:              0.10.2
	- tomli:             2.0.1
	- tomlkit:           0.12.1
	- torch:             2.1.0
	- torch-cluster:     1.6.3
	- torch-geometric:   2.4.0
	- torch-scatter:     2.1.2
	- torch-sparse:      0.6.18
	- torchmetrics:      1.2.0
	- tornado:           6.3.3
	- tqdm:              4.66.1
	- traitlets:         5.11.2
	- triton:            2.1.0
	- trove-classifiers: 2023.9.19
	- types-python-dateutil: 2.8.19.14
	- typing-extensions: 4.8.0
	- typing-utils:      0.1.0
	- tzdata:            2023.3
	- ukkonen:           1.0.1
	- uri-template:      1.3.0
	- urllib3:           1.26.17
	- uvicorn:           0.23.2
	- virtualenv:        20.24.4
	- wandb:             0.15.12
	- wcwidth:           0.2.8
	- webcolors:         1.13
	- webencodings:      0.5.1
	- websocket-client:  1.6.4
	- websockets:        11.0.3
	- werkzeug:          3.0.0
	- wheel:             0.41.2
	- widgetsnbextension: 4.0.9
	- wrapt:             1.15.0
	- yarl:              1.9.2
	- zipp:              3.17.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.11.6
	- release:           5.15.0-1032-gcp
	- version:           #40~20.04.1-Ubuntu SMP Tue Apr 11 02:49:52 UTC 2023

</details>

More info

No response

shenoynikhil avatar Oct 16 '23 16:10 shenoynikhil

I can confirm that I also experiencing this bug. Downgrading to 2.0.8 fixes it

Current env:

lightning                     2.1.0
lightning-cloud               0.5.42
lightning-utilities           0.9.0
pytorch-lightning             2.1.0

torch                         2.1.0+cu118
torchmetrics                  1.2.0
torchvision                   0.16.0+cu118

Stack:

File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 545, in fit
    call._call_and_handle_interrupt(
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 581, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 990, in _run
    results = self._run_stage()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1034, in _run_stage
    self._run_sanity_check()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1063, in _run_sanity_check
    val_loop.run()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/utilities.py", line 181, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 141, in run
    return self.on_run_end()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 253, in on_run_end
    self._on_evaluation_epoch_end()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 331, in _on_evaluation_epoch_end
    trainer._logger_connector.on_epoch_end()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 187, in on_epoch_end
Traceback (most recent call last):
    metrics = self.metrics
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 226, in metrics
    return self.trainer._results.metrics(on_step)
  File "/mnt/hdd1/users/Documents/dev/sandbox_dsm_scripts/train/train.py", line 91, in <module>
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 471, in metrics
    value = self._get_cache(result_metric, on_step)
    pytroch()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 435, in _get_cache
  File "/mnt/hdd1/users/Documents/dev/sandbox_dsm_scripts/train/train.py", line 73, in pytroch
    result_metric.compute()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 280, in wrapped_func
    trainer.train_cls_locally(
  File "/mnt/hdd1/users/Documents/dev/trainer.py", line 927, in train_cls_locally
    self._computed = compute(*args, **kwargs)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 245, in compute
    cumulated_batch_size = self.meta.sync(self.cumulated_batch_size)
    trainer.fit(model, datamodule=datamodule)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 330, in reduce
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 545, in fit
    return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available
    call._call_and_handle_interrupt(
    return _sync_ddp(result, group=group, reduce_op=reduce_op)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
    torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1536, in all_reduce
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
    work = group.allreduce([tensor], opts)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 581, in _fit_impl
RuntimeError: Tensors must be CUDA and dense
    self._run(model, ckpt_path=ckpt_path)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 990, in _run
    results = self._run_stage()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1034, in _run_stage
    self._run_sanity_check()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1063, in _run_sanity_check
    val_loop.run()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/utilities.py", line 181, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 141, in run
    return self.on_run_end()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 253, in on_run_end
    self._on_evaluation_epoch_end()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 331, in _on_evaluation_epoch_end
    trainer._logger_connector.on_epoch_end()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 187, in on_epoch_end
    metrics = self.metrics
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 226, in metrics
    return self.trainer._results.metrics(on_step)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 471, in metrics
    value = self._get_cache(result_metric, on_step)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 435, in _get_cache
    result_metric.compute()
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 280, in wrapped_func
    self._computed = compute(*args, **kwargs)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 245, in compute
    cumulated_batch_size = self.meta.sync(self.cumulated_batch_size)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 330, in reduce
    return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/fabric/utilities/distributed.py", line 171, in _sync_ddp_if_available
    return _sync_ddp(result, group=group, reduce_op=reduce_op)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/lightning/fabric/utilities/distributed.py", line 221, in _sync_ddp
    torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
  File "/home/miniconda3/envs/pt-lght/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1536, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

andwaal-esmart avatar Oct 17 '23 20:10 andwaal-esmart

I see this as well

kleinhenz avatar Oct 18 '23 21:10 kleinhenz

I started seeing this error but couldn't figure out what has caused it. It appears after the first validation epoch, apparently when computing a metric in the on_epoch_end callback. Downgrading to 2.0.8 helped.

senarvi avatar Oct 19 '23 08:10 senarvi

Same bug here after upgrade to torch==2.1.0 and lightning==2.1.0.

This bug appeared when running Metric.compute() of a torchmetric after a validation epoch.

Edit: I am using lightning fabric instead of lightning trainer. This bug is also triggered.

pableeto avatar Oct 20 '23 05:10 pableeto

Same for me. Downgrading to pytorch-lightning==2.0.8 fixed the issue.

fakufaku avatar Oct 25 '23 13:10 fakufaku

I've got the same error on torch==2.1.0 and lightning==2.1.0 and fixed when downgrading to pytorch_lightning==2.0.8

samils7 avatar Oct 25 '23 23:10 samils7

I also just ran across this error. It seems like the self.log(key, val) calls have changed in some way, as in my case the error went away if I manually moved val to the GPU in every call of self.log in my code

emannix avatar Oct 27 '23 01:10 emannix

My feeling is that the DDP strategy in lightning==2.0.8 initialized distributed backends for both CPU and GPU when running with device=GPU. Below is a minimal example that works with 2.0.8, but crashes in 2.1.0:

import torch
from lightning import Trainer, LightningModule
from torch.utils.data import DataLoader


class LitModel(LightningModule):
    def __init__(self) -> None:
        super().__init__()
        self.layer = torch.nn.Linear(1, 1)

    def training_step(self, x):
        # Everything but the next line is just dummy-code to make it run
        self.log(
            "foo", value=torch.zeros(1, device="cpu"), on_step=True, sync_dist=True
        )
        loss = self.layer(x).mean()
        return loss

    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=0.1)

    def train_dataloader(self):
        return DataLoader(torch.randn(32, 1), batch_size=1)


def main():
    model = LitModel()
    trainer = Trainer(devices=2, accelerator="gpu", max_epochs=2)
    trainer.fit(model)


if __name__ == "__main__":
    main()

Note that this isn't restricted to distributed code that's run by lightning. We have some functionality that uses torch.distributed directly and are running into the same exact issue when we try to broadcast non-CUDA tensors.

dsuess avatar Oct 27 '23 04:10 dsuess

Has this issue been addressed in nightly? I was really trying to stick to either pip or conda versions and it looks like 2.0.8 is not available on either.

egoetz avatar Nov 21 '23 06:11 egoetz

Same issue with PyTorch 2.1.1 and Lightning 2.1.2

celpas avatar Nov 23 '23 14:11 celpas

It looks like the change comes from this PR: #17334 (git-bisecting code sample by @dsuess)

awaelchli avatar Nov 27 '23 23:11 awaelchli

It looks like the changes was intentional. The changelog says:

self.loged tensors are now kept in the original device to reduce unnecessary host-to-device synchronizations (#17334)

This means if you pass in the tensor, it already needs to be on the right device and the user needs to explicitly perform the .to() call.

cc @carmocca

awaelchli avatar Nov 27 '23 23:11 awaelchli

The resolution is not clear to me. I'm getting the message "RuntimeError: No backend type associated with device type cpu". If I was logging 20 things some of them on CPU some of GPU what should I be doing? From your comment @awaelchli I would've thought adding .to('cpu') calls but the error message makes me thing the opposite (but moving CPU results back to GPU also seems silly).

RuABraun avatar Dec 04 '23 05:12 RuABraun

If I understood correctly, when using self.log(..., sync_dist=True) with DDP, you have to transfer the tensor to the GPU before logging.

Is it possible to move the tensors to the correct device automatically in LightningModule.log()? If not, I feel like this should be mentioned in the documentation, and it would be good to give a better error message. Currently the 15-minute Lightning tutorial instructs to remove any .cuda() or device calls, because LightningModules are hardware agnostic.

senarvi avatar Dec 04 '23 08:12 senarvi

@awaelchli Thanks for clarifying. I've found another corner case where the new behaviour breaks existing code: If you re-use a trainer instance multiple times (e.g. for evaluating multiple epochs), you can end up with metrics moved to CPU even if you log them with GPU tensors.

The reason being that the logger connector moves all intermediate results to CPU on teardown. So on the second call to trainer.validate, the helper-state (e.g. cumulated_batch_size) of the cached results are on CPU. This can be fixed by removing all cached results through

trainer.validate_loop._results.clear()

Here's a full example to reproduce this:

import torch
from lightning import Trainer, LightningModule
from torch.utils.data import DataLoader


class LitModel(LightningModule):
    def __init__(self) -> None:
        super().__init__()
        self.layer = torch.nn.Linear(1, 1)

    def training_step(self, x):
        loss = self.layer(x).mean()
        return loss

    def validation_step(self, *args, **kwargs):
        self.log(
            "foo", value=torch.zeros(1, device=self.device), on_step=True, sync_dist=True
        )
        return super().validation_step(*args, **kwargs)

    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=0.1)

    def val_dataloader(self):
        return DataLoader(torch.randn(32, 1), batch_size=1)


def main():
    model = LitModel()
    trainer = Trainer(devices=2, accelerator="gpu", max_epochs=2)
    trainer.validate(model)
    # Uncomment the following line to fix the issue
    #trainer.validate_loop._results.clear()
    trainer.validate(model)


if __name__ == "__main__":
    main()

dsuess avatar Dec 05 '23 00:12 dsuess

The reason being that the logger connector moves all intermediate results to CPU on teardown. So on the second call to trainer.validate, the helper-state (e.g. cumulated_batch_size) of the cached results are on CPU. This can be fixed by removing all cached results through

trainer.validate_loop._results.clear()

If you want to call trainer.fit twice, the analogue fix is:

trainer.fit_loop.epoch_loop.val_loop._results.clear()

vitusbenson avatar Dec 30 '23 11:12 vitusbenson

I'm having the same issue using the latest version and resolved by downgrading to lightning==2.0.9.

yirending avatar Jan 02 '24 16:01 yirending

I've solved the issue on lightning==2.1.3 . When rewriting any epoch_end function, if you log, just make sure that the tensor is on gpu device. If you initialize new tensor, initialize it with device=self.device

ouioui199 avatar Jan 16 '24 09:01 ouioui199

I've solved the issue on lightning==2.1.3 . When rewriting any epoch_end function, if you log, just make sure that the tensor is on gpu device. If you initialize new tensor, initialize it with device=self.device

@ouioui199 suggestion works. I changed my code from self.log_dict( {f"test_map_{label}": value for label, value in zip(self.id2label.values(), mAP_per_class)}, sync_dist=True, )

to

self.log_dict( {f"test_map_{label}": value.to("cuda") for label, value in zip(self.id2label.values(), mAP_per_class)}, sync_dist=True, )

xzklwj avatar Feb 01 '24 01:02 xzklwj

I've solved the issue on lightning==2.1.3 . When rewriting any epoch_end function, if you log, just make sure that the tensor is on gpu device. If you initialize new tensor, initialize it with device=self.device

@ouioui199 suggestion works. I changed my code from self.log_dict( {f"test_map_{label}": value for label, value in zip(self.id2label.values(), mAP_per_class)}, sync_dist=True, )

to

self.log_dict( {f"test_map_{label}": value.to("cuda") for label, value in zip(self.id2label.values(), mAP_per_class)}, sync_dist=True, )

This is really helpful. This does have something to do with the torchmetrics and ddp processes. I use Callback for logging purposes and just change the log value to the device:

class AwesomeCallback(Callback):
    def on_validation_epoch_end(self, trainer: pl.Trainer, pl_module: pl.LightningModule):
        pl_module.log("some metrics", some_value.to(pl_module.device), sync_dist=True)

haritsahm avatar Mar 22 '24 18:03 haritsahm

This error can also be reproduced using torchmetrics 1.3.2 when storing lists of tensors in CPU compute_on_cpu=True

rballeba avatar Mar 27 '24 10:03 rballeba

Got the same error, when trying to compute a confusion matrix on a callback when I call metric.plot():

def on_test_epoch_end(self) -> None:
     metric = MulticlassConfusionMatrix(num_classes=self.num_classes).to("cpu")

     outputs = torch.cat(self.x_test, dim=0).to("cpu")
     labels = torch.cat(self.y_test, dim=0).to("cpu")
     outputs = torch.softmax(outputs, dim=1).argmax(dim=1)
     metric.update(outputs, labels)
     pl = ["Latin", "Russian", "Arabic", "Chinese"]
     fig_, ax_ = metric.plot(labels=pl)
     fig_.savefig("test.png")

asusdisciple avatar May 14 '24 13:05 asusdisciple

Same bug here after upgrade to torch==2.1.0 and lightning==2.1.0.

This bug appeared when running Metric.compute() of a torchmetric after a validation epoch.

Edit: I am using lightning fabric instead of lightning trainer. This bug is also triggered.

For me, I also saw this on Metric.compute(). It happened when I was running integration tests where one used a DDPStraregy and the other used a single process strategy on the cpu. After distributed process groups are created, an error seems to be raised if a metric is computed on the cpu.

import lightning
import torch
import torchmetrics
from torch import nn

fabric = lightning.Fabric(accelerator="cuda", devices=2)
fabric.launch()
module = nn.Linear(2, 1)
module = fabric.setup(module)

metric = torchmetrics.Accuracy(task="multiclass", num_classes=2)
metric.update(torch.tensor([0., 1.]), torch.tensor([0, 1]))
metric.compute()
RuntimeError:
No backend type associated with device type cpu

It seems to happen because torchmetrics uses torch.distributed.group.WORLD as the process group for CPU metrics.

ringohoffman avatar Jul 03 '24 21:07 ringohoffman