pytorch-lightning
pytorch-lightning copied to clipboard
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.
Bug description
Trying to use TPU in Kaggle and receiving the error "RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1."
I am using 8 TPU cores, Here my Trainer:
trainer = Trainer(
max_epochs=50,
accelerator="tpu",
devices=8,
callbacks=[pl.callbacks.EarlyStopping(monitor='val_loss', patience=2)]
)
I am new to machine learning please tell me if I make mistakes
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.302361 2870 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8476 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.407367 2874 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8477 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.442340 2878 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8478 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.453311 2882 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8479 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
return [fn(*args) for args in chunk]
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
return [fn(*args) for args in chunk]
File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 95, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 59, in _run_thread_per_device
initializer_fn(local_rank, local_world_size)
File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 95, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 125, in initialize_multiprocess
devices = xm.get_xla_supported_devices()
File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 99, in get_xla_supported_devices
devices = torch_xla._XLAC._xla_get_devices()
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.
"""
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
Cell In[47], line 12
1 model = ToxicCommentModel(
2 input_size=hyperparameters["input_size"],
3 hidden_size=hyperparameters["linear_hidden_size"],
(...)
10 max_len=hyperparameters["context_length"]
11 )
---> 12 trainer.fit(model, data_module)
File /usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:538, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
536 self.state.status = TrainerStatus.RUNNING
537 self.training = True
--> 538 call._call_and_handle_interrupt(
539 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
540 )
File /usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:46, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
44 try:
45 if trainer.strategy.launcher is not None:
---> 46 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
47 return trainer_fn(*args, **kwargs)
49 except _TunerExitException:
File /usr/local/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/xla.py:98, in _XLALauncher.launch(self, function, trainer, *args, **kwargs)
93 if nprocs == 1:
94 # avoid warning: "Unsupported nprocs". If it's 1, it will call the launched function directly.
95 # otherwise it will use all devices
96 spawn_kwargs["nprocs"] = nprocs
---> 98 process_context = xmp.spawn(
99 self._wrapping_function,
100 args=(trainer, function, args, kwargs, return_queue),
101 start_method=self._start_method,
102 join=False, # we will join ourselves to get the process references
103 **spawn_kwargs,
104 )
105 # xla will not actually create processes if only 1 device
106 if process_context is not None:
File /usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:95, in requires_pjrt.<locals>.wrapper(*args, **kwargs)
91 if not using_pjrt():
92 raise NotImplementedError('`{}` not implemented for XRT'.format(
93 fn.__name__))
---> 95 return fn(*args, **kwargs)
File /usr/local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py:38, in spawn(fn, args, nprocs, join, daemon, start_method)
6 @xr.requires_pjrt
7 def spawn(fn,
8 args=(),
(...)
11 daemon=False,
12 start_method='spawn'):
13 """Enables multi processing based replication.
14
15 Args:
(...)
36 return None.
37 """
---> 38 return pjrt.spawn(fn, nprocs, start_method, args)
File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:214, in spawn(fn, nprocs, start_method, args)
211 elif nprocs is not None:
212 logging.warning('Unsupported nprocs (%d), ignoring...' % nprocs)
--> 214 run_multiprocess(spawn_fn, start_method=start_method)
File /usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:95, in requires_pjrt.<locals>.wrapper(*args, **kwargs)
91 if not using_pjrt():
92 raise NotImplementedError('`{}` not implemented for XRT'.format(
93 fn.__name__))
---> 95 return fn(*args, **kwargs)
File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:174, in run_multiprocess(fn, start_method, *args, **kwargs)
168 mp_fn = functools.partial(
169 _run_thread_per_device,
170 local_world_size=num_processes,
171 fn=functools.partial(fn, *args, **kwargs),
172 initializer_fn=initialize_multiprocess)
173 process_results = executor.map(mp_fn, range(num_processes))
--> 174 replica_results = list(
175 itertools.chain.from_iterable(
176 result.items() for result in process_results))
178 return _merge_replica_results(replica_results)
File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:175, in <genexpr>(.0)
168 mp_fn = functools.partial(
169 _run_thread_per_device,
170 local_world_size=num_processes,
171 fn=functools.partial(fn, *args, **kwargs),
172 initializer_fn=initialize_multiprocess)
173 process_results = executor.map(mp_fn, range(num_processes))
174 replica_results = list(
--> 175 itertools.chain.from_iterable(
176 result.items() for result in process_results))
178 return _merge_replica_results(replica_results)
File /usr/local/lib/python3.10/concurrent/futures/process.py:575, in _chain_from_iterable_of_lists(iterable)
569 def _chain_from_iterable_of_lists(iterable):
570 """
571 Specialized implementation of itertools.chain.from_iterable.
572 Each item in *iterable* should be a list. This function is
573 careful not to keep references to yielded objects.
574 """
--> 575 for element in iterable:
576 element.reverse()
577 while element:
File /usr/local/lib/python3.10/concurrent/futures/_base.py:621, in Executor.map.<locals>.result_iterator()
618 while fs:
619 # Careful not to keep a reference to the popped future
620 if timeout is None:
--> 621 yield _result_or_cancel(fs.pop())
622 else:
623 yield _result_or_cancel(fs.pop(), end_time - time.monotonic())
File /usr/local/lib/python3.10/concurrent/futures/_base.py:319, in _result_or_cancel(***failed resolving arguments***)
317 try:
318 try:
--> 319 return fut.result(timeout)
320 finally:
321 fut.cancel()
File /usr/local/lib/python3.10/concurrent/futures/_base.py:458, in Future.result(self, timeout)
456 raise CancelledError()
457 elif self._state == FINISHED:
--> 458 return self.__get_result()
459 else:
460 raise TimeoutError()
File /usr/local/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self)
401 if self._exception:
402 try:
--> 403 raise self._exception
404 finally:
405 # Break a reference cycle with the exception in self._exception
406 self = None
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.
Environment
Current environment
- CUDA:
- GPU: None
- available: False
- version: 12.1
- Lightning:
- lightning-utilities: 0.11.7
- pytorch-lightning: 2.4.0
- torch: 2.4.0
- torch-xla: 2.4.0+libtpu
- torchaudio: 2.4.0
- torchmetrics: 1.4.1
- torchvision: 0.19.0
- Packages:
- absl-py: 2.1.0
- accelerate: 0.33.0
- aiofiles: 22.1.0
- aiohappyeyeballs: 2.4.0
- aiohttp: 3.10.5
- aiosignal: 1.3.1
- aiosqlite: 0.20.0
- albucore: 0.0.13
- albumentations: 1.4.14
- annotated-types: 0.7.0
- ansicolors: 1.1.8
- anyio: 4.4.0
- argon2-cffi: 23.1.0
- argon2-cffi-bindings: 21.2.0
- array-record: 0.5.1
- arrow: 1.3.0
- astroid: 3.2.4
- asttokens: 2.4.1
- astunparse: 1.6.3
- async-timeout: 4.0.3
- attrs: 24.2.0
- audioread: 3.0.1
- autopep8: 2.0.4
- babel: 2.16.0
- beautifulsoup4: 4.12.3
- bleach: 6.1.0
- blis: 0.7.11
- cachetools: 5.5.0
- catalogue: 2.0.10
- certifi: 2024.7.4
- cffi: 1.17.0
- charset-normalizer: 3.3.2
- chex: 0.1.86
- click: 8.1.7
- cloud-tpu-client: 0.10
- cloudpathlib: 0.19.0
- cloudpickle: 3.0.0
- comm: 0.2.2
- confection: 0.1.5
- contourpy: 1.2.1
- cramjam: 2.8.3
- cycler: 0.12.1
- cymem: 2.0.8
- debugpy: 1.8.5
- decorator: 5.1.1
- defusedxml: 0.7.1
- diffusers: 0.30.0
- dill: 0.3.8
- distrax: 0.1.5
- dm-haiku: 0.0.13.dev0
- dm-tree: 0.1.8
- docstring-parser: 0.16
- docstring-to-markdown: 0.15
- einops: 0.8.0
- en-core-web-sm: 3.7.1
- entrypoints: 0.4
- etils: 1.7.0
- eval-type-backport: 0.2.0
- exceptiongroup: 1.2.2
- executing: 2.0.1
- fastjsonschema: 2.20.0
- fastparquet: 2024.5.0
- filelock: 3.15.4
- flake8: 7.0.0
- flatbuffers: 24.3.25
- flax: 0.8.4
- fonttools: 4.53.1
- fqdn: 1.5.1
- frozenlist: 1.4.1
- fsspec: 2024.6.1
- funcsigs: 1.0.2
- gast: 0.6.0
- gin-config: 0.5.0
- google-api-core: 1.34.1
- google-api-python-client: 1.8.0
- google-auth: 2.34.0
- google-auth-httplib2: 0.2.0
- google-pasta: 0.2.0
- googleapis-common-protos: 1.63.2
- grpcio: 1.65.5
- gym: 0.26.2
- gym-notices: 0.0.8
- h5py: 3.11.0
- httplib2: 0.22.0
- huggingface-hub: 0.24.6
- idna: 3.7
- imageio: 2.35.1
- immutabledict: 4.2.0
- importlib-metadata: 8.3.0
- importlib-resources: 6.4.3
- ipykernel: 6.29.5
- ipython: 8.26.0
- ipython-genutils: 0.2.0
- isoduration: 20.11.0
- isort: 5.13.2
- jax: 0.4.23
- jaxlib: 0.4.23
- jedi: 0.19.1
- jinja2: 3.1.4
- jmp: 0.0.4
- joblib: 1.4.2
- jraph: 0.0.6.dev0
- json5: 0.9.25
- jsonpointer: 3.0.0
- jsonschema: 4.23.0
- jsonschema-specifications: 2023.12.1
- jupyter-client: 7.4.9
- jupyter-core: 5.7.2
- jupyter-events: 0.10.0
- jupyter-lsp: 1.5.1
- jupyter-server: 2.14.2
- jupyter-server-fileid: 0.9.2
- jupyter-server-terminals: 0.5.3
- jupyter-server-ydoc: 0.8.0
- jupyter-ydoc: 0.2.5
- jupyterlab: 3.6.7
- jupyterlab-pygments: 0.3.0
- jupyterlab-server: 2.27.3
- kagglehub: 0.2.9
- keras: 3.5.0
- keras-core: 0.1.7
- keras-cv: 0.9.0
- keras-nlp: 0.14.4
- kiwisolver: 1.4.5
- langcodes: 3.4.0
- language-data: 1.2.0
- lazy-loader: 0.4
- libclang: 18.1.1
- librosa: 0.10.2.post1
- libtpu-nightly: 0.1.dev20231213
- lightning-utilities: 0.11.7
- llvmlite: 0.43.0
- marisa-trie: 1.2.0
- markdown: 3.7
- markdown-it-py: 3.0.0
- markupsafe: 2.1.5
- matplotlib: 3.9.2
- matplotlib-inline: 0.1.7
- mccabe: 0.7.0
- mdurl: 0.1.2
- mistune: 3.0.2
- ml-dtypes: 0.3.2
- mpmath: 1.3.0
- msgpack: 1.0.8
- multidict: 6.0.5
- murmurhash: 1.0.10
- namex: 0.0.8
- nbclassic: 1.1.0
- nbclient: 0.10.0
- nbconvert: 7.16.4
- nbformat: 5.10.4
- nest-asyncio: 1.6.0
- networkx: 3.3
- notebook: 6.5.7
- notebook-shim: 0.2.4
- numba: 0.60.0
- numpy: 1.26.4
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 9.1.0.70
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-nccl-cu12: 2.20.5
- nvidia-nvjitlink-cu12: 12.6.20
- nvidia-nvtx-cu12: 12.1.105
- oauth2client: 4.1.3
- opencv-python: 4.10.0.84
- opencv-python-headless: 4.10.0.84
- opt-einsum: 3.3.0
- optax: 0.2.2
- optree: 0.12.1
- orbax-checkpoint: 0.5.16
- overrides: 7.7.0
- packaging: 24.1
- pandas: 2.2.2
- pandocfilters: 1.5.1
- papermill: 2.6.0
- parso: 0.8.4
- pexpect: 4.9.0
- pillow: 10.4.0
- pip: 23.0.1
- platformdirs: 4.2.2
- pluggy: 1.5.0
- pooch: 1.8.2
- preshed: 3.0.9
- prometheus-client: 0.20.0
- promise: 2.3
- prompt-toolkit: 3.0.47
- protobuf: 3.20.3
- psutil: 6.0.0
- ptyprocess: 0.7.0
- pure-eval: 0.2.3
- pyarrow: 17.0.0
- pyasn1: 0.6.0
- pyasn1-modules: 0.4.0
- pycodestyle: 2.11.1
- pycparser: 2.22
- pydantic: 2.8.2
- pydantic-core: 2.20.1
- pydocstyle: 6.3.0
- pyflakes: 3.2.0
- pygments: 2.18.0
- pylint: 3.2.6
- pyparsing: 3.1.2
- python-dateutil: 2.9.0.post0
- python-json-logger: 2.0.7
- python-lsp-jsonrpc: 1.1.2
- python-lsp-server: 1.11.0
- pytoolconfig: 1.3.1
- pytorch-lightning: 2.4.0
- pytz: 2024.1
- pyyaml: 6.0.2
- pyzmq: 26.1.1
- referencing: 0.35.1
- regex: 2024.7.24
- requests: 2.32.3
- rfc3339-validator: 0.1.4
- rfc3986-validator: 0.1.1
- rich: 13.7.1
- rope: 1.13.0
- rpds-py: 0.20.0
- rsa: 4.9
- safetensors: 0.4.4
- scikit-image: 0.24.0
- scikit-learn: 1.5.1
- scipy: 1.14.0
- seaborn: 0.13.2
- send2trash: 1.8.3
- setuptools: 65.5.1
- shellingham: 1.5.4
- simple-parsing: 0.1.5
- six: 1.16.0
- smart-open: 7.0.4
- sniffio: 1.3.1
- snowballstemmer: 2.2.0
- soundfile: 0.12.1
- soupsieve: 2.6
- soxr: 0.4.0
- spacy: 3.7.6
- spacy-legacy: 3.0.12
- spacy-loggers: 1.0.5
- srsly: 2.4.8
- stack-data: 0.6.3
- sympy: 1.13.2
- tabulate: 0.9.0
- tenacity: 9.0.0
- tensorboard: 2.17.1
- tensorboard-data-server: 0.7.2
- tensorflow-cpu: 2.17.0
- tensorflow-datasets: 4.9.6
- tensorflow-hub: 0.16.1
- tensorflow-io: 0.37.1
- tensorflow-io-gcs-filesystem: 0.37.1
- tensorflow-metadata: 1.15.0
- tensorflow-probability: 0.24.0
- tensorflow-text: 2.16.1
- tensorstore: 0.1.64
- termcolor: 2.4.0
- terminado: 0.18.1
- tf-keras: 2.16.0
- thinc: 8.2.5
- threadpoolctl: 3.5.0
- tifffile: 2024.8.10
- timm: 1.0.8
- tinycss2: 1.3.0
- tokenizers: 0.19.1
- toml: 0.10.2
- tomli: 2.0.1
- tomlkit: 0.13.2
- toolz: 0.12.1
- torch: 2.4.0
- torch-xla: 2.4.0+libtpu
- torchaudio: 2.4.0
- torchmetrics: 1.4.1
- torchvision: 0.19.0
- tornado: 6.4.1
- tqdm: 4.66.5
- traitlets: 5.14.3
- transformers: 4.44.0
- trax: 1.4.1
- triton: 3.0.0
- typer: 0.12.5
- types-python-dateutil: 2.9.0.20240316
- typing-extensions: 4.12.2
- tzdata: 2024.1
- ujson: 5.10.0
- uri-template: 1.3.0
- uritemplate: 3.0.1
- urllib3: 2.2.2
- wasabi: 1.1.3
- wcwidth: 0.2.13
- weasel: 0.4.1
- webcolors: 24.8.0
- webencodings: 0.5.1
- websocket-client: 1.8.0
- werkzeug: 3.0.3
- whatthepatch: 1.0.6
- wheel: 0.44.0
- wrapt: 1.16.0
- y-py: 0.6.2
- yapf: 0.40.2
- yarl: 1.9.7
- ypy-websocket: 0.8.4
- zipp: 3.20.0
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor:
- python: 3.10.14
- release: 6.1.42+
- version: #1 SMP PREEMPT_DYNAMIC Sun Oct 8 14:23:56 UTC 2023
More info
No response
anyone?
anyone?
this is not a lightning bug. i had the exactly same error on kaggle tpu v3-8 and found the fix in the kaggle product feedback discussion. here is the link: https://www.kaggle.com/discussions/product-feedback/473974 tl;dr: remove offending environment variable os.environ.pop('TPU_PROCESS_ADDRESSES')
Thanks @ibinti
import os
os.environ.pop('TPU_PROCESS_ADDRESSES')
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!
Closing issue as not a lightning problem but related to kaggle. Thanks for providing a solution @ibinti.