pytorch-lightning RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.

Bug description

Trying to use TPU in Kaggle and receiving the error "RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1."

I am using 8 TPU cores, Here my Trainer:

trainer = Trainer(
    max_epochs=50,
    accelerator="tpu",
    devices=8,
    callbacks=[pl.callbacks.EarlyStopping(monitor='val_loss', patience=2)]
)

I am new to machine learning please tell me if I make mistakes

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.302361    2870 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8476 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.407367    2874 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8477 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: === 
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.442340    2878 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8478 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.453311    2882 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8479 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: === 
learning/45eac/tfrc/runtime/common_lib.cc:483
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 95, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 59, in _run_thread_per_device
    initializer_fn(local_rank, local_world_size)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 95, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 125, in initialize_multiprocess
    devices = xm.get_xla_supported_devices()
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 99, in get_xla_supported_devices
    devices = torch_xla._XLAC._xla_get_devices()
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.
"""

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[47], line 12
      1 model = ToxicCommentModel(
      2     input_size=hyperparameters["input_size"], 
      3     hidden_size=hyperparameters["linear_hidden_size"],  
   (...)
     10     max_len=hyperparameters["context_length"]
     11 )
---> 12 trainer.fit(model, data_module)

File /usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:538, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    536 self.state.status = TrainerStatus.RUNNING
    537 self.training = True
--> 538 call._call_and_handle_interrupt(
    539     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    540 )

File /usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:46, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     44 try:
     45     if trainer.strategy.launcher is not None:
---> 46         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     47     return trainer_fn(*args, **kwargs)
     49 except _TunerExitException:

File /usr/local/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/xla.py:98, in _XLALauncher.launch(self, function, trainer, *args, **kwargs)
     93 if nprocs == 1:
     94     # avoid warning: "Unsupported nprocs". If it's 1, it will call the launched function directly.
     95     # otherwise it will use all devices
     96     spawn_kwargs["nprocs"] = nprocs
---> 98 process_context = xmp.spawn(
     99     self._wrapping_function,
    100     args=(trainer, function, args, kwargs, return_queue),
    101     start_method=self._start_method,
    102     join=False,  # we will join ourselves to get the process references
    103     **spawn_kwargs,
    104 )
    105 # xla will not actually create processes if only 1 device
    106 if process_context is not None:

File /usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:95, in requires_pjrt.<locals>.wrapper(*args, **kwargs)
     91 if not using_pjrt():
     92   raise NotImplementedError('`{}` not implemented for XRT'.format(
     93       fn.__name__))
---> 95 return fn(*args, **kwargs)

File /usr/local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py:38, in spawn(fn, args, nprocs, join, daemon, start_method)
      6 @xr.requires_pjrt
      7 def spawn(fn,
      8           args=(),
   (...)
     11           daemon=False,
     12           start_method='spawn'):
     13   """Enables multi processing based replication.
     14 
     15   Args:
   (...)
     36     return None.
     37   """
---> 38   return pjrt.spawn(fn, nprocs, start_method, args)

File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:214, in spawn(fn, nprocs, start_method, args)
    211 elif nprocs is not None:
    212   logging.warning('Unsupported nprocs (%d), ignoring...' % nprocs)
--> 214 run_multiprocess(spawn_fn, start_method=start_method)

File /usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:95, in requires_pjrt.<locals>.wrapper(*args, **kwargs)
     91 if not using_pjrt():
     92   raise NotImplementedError('`{}` not implemented for XRT'.format(
     93       fn.__name__))
---> 95 return fn(*args, **kwargs)

File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:174, in run_multiprocess(fn, start_method, *args, **kwargs)
    168   mp_fn = functools.partial(
    169       _run_thread_per_device,
    170       local_world_size=num_processes,
    171       fn=functools.partial(fn, *args, **kwargs),
    172       initializer_fn=initialize_multiprocess)
    173   process_results = executor.map(mp_fn, range(num_processes))
--> 174   replica_results = list(
    175       itertools.chain.from_iterable(
    176           result.items() for result in process_results))
    178 return _merge_replica_results(replica_results)

File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:175, in <genexpr>(.0)
    168   mp_fn = functools.partial(
    169       _run_thread_per_device,
    170       local_world_size=num_processes,
    171       fn=functools.partial(fn, *args, **kwargs),
    172       initializer_fn=initialize_multiprocess)
    173   process_results = executor.map(mp_fn, range(num_processes))
    174   replica_results = list(
--> 175       itertools.chain.from_iterable(
    176           result.items() for result in process_results))
    178 return _merge_replica_results(replica_results)

File /usr/local/lib/python3.10/concurrent/futures/process.py:575, in _chain_from_iterable_of_lists(iterable)
    569 def _chain_from_iterable_of_lists(iterable):
    570     """
    571     Specialized implementation of itertools.chain.from_iterable.
    572     Each item in *iterable* should be a list.  This function is
    573     careful not to keep references to yielded objects.
    574     """
--> 575     for element in iterable:
    576         element.reverse()
    577         while element:

File /usr/local/lib/python3.10/concurrent/futures/_base.py:621, in Executor.map.<locals>.result_iterator()
    618 while fs:
    619     # Careful not to keep a reference to the popped future
    620     if timeout is None:
--> 621         yield _result_or_cancel(fs.pop())
    622     else:
    623         yield _result_or_cancel(fs.pop(), end_time - time.monotonic())

File /usr/local/lib/python3.10/concurrent/futures/_base.py:319, in _result_or_cancel(***failed resolving arguments***)
    317 try:
    318     try:
--> 319         return fut.result(timeout)
    320     finally:
    321         fut.cancel()

File /usr/local/lib/python3.10/concurrent/futures/_base.py:458, in Future.result(self, timeout)
    456     raise CancelledError()
    457 elif self._state == FINISHED:
--> 458     return self.__get_result()
    459 else:
    460     raise TimeoutError()

File /usr/local/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self)
    401 if self._exception:
    402     try:
--> 403         raise self._exception
    404     finally:
    405         # Break a reference cycle with the exception in self._exception
    406         self = None

RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.

Environment

Current environment

CUDA:
- GPU: None
- available: False
- version: 12.1
Lightning:
- lightning-utilities: 0.11.7
- pytorch-lightning: 2.4.0
- torch: 2.4.0
- torch-xla: 2.4.0+libtpu
- torchaudio: 2.4.0
- torchmetrics: 1.4.1
- torchvision: 0.19.0
Packages:
- absl-py: 2.1.0
- accelerate: 0.33.0
- aiofiles: 22.1.0
- aiohappyeyeballs: 2.4.0
- aiohttp: 3.10.5
- aiosignal: 1.3.1
- aiosqlite: 0.20.0
- albucore: 0.0.13
- albumentations: 1.4.14
- annotated-types: 0.7.0
- ansicolors: 1.1.8
- anyio: 4.4.0
- argon2-cffi: 23.1.0
- argon2-cffi-bindings: 21.2.0
- array-record: 0.5.1
- arrow: 1.3.0
- astroid: 3.2.4
- asttokens: 2.4.1
- astunparse: 1.6.3
- async-timeout: 4.0.3
- attrs: 24.2.0
- audioread: 3.0.1
- autopep8: 2.0.4
- babel: 2.16.0
- beautifulsoup4: 4.12.3
- bleach: 6.1.0
- blis: 0.7.11
- cachetools: 5.5.0
- catalogue: 2.0.10
- certifi: 2024.7.4
- cffi: 1.17.0
- charset-normalizer: 3.3.2
- chex: 0.1.86
- click: 8.1.7
- cloud-tpu-client: 0.10
- cloudpathlib: 0.19.0
- cloudpickle: 3.0.0
- comm: 0.2.2
- confection: 0.1.5
- contourpy: 1.2.1
- cramjam: 2.8.3
- cycler: 0.12.1
- cymem: 2.0.8
- debugpy: 1.8.5
- decorator: 5.1.1
- defusedxml: 0.7.1
- diffusers: 0.30.0
- dill: 0.3.8
- distrax: 0.1.5
- dm-haiku: 0.0.13.dev0
- dm-tree: 0.1.8
- docstring-parser: 0.16
- docstring-to-markdown: 0.15
- einops: 0.8.0
- en-core-web-sm: 3.7.1
- entrypoints: 0.4
- etils: 1.7.0
- eval-type-backport: 0.2.0
- exceptiongroup: 1.2.2
- executing: 2.0.1
- fastjsonschema: 2.20.0
- fastparquet: 2024.5.0
- filelock: 3.15.4
- flake8: 7.0.0
- flatbuffers: 24.3.25
- flax: 0.8.4
- fonttools: 4.53.1
- fqdn: 1.5.1
- frozenlist: 1.4.1
- fsspec: 2024.6.1
- funcsigs: 1.0.2
- gast: 0.6.0
- gin-config: 0.5.0
- google-api-core: 1.34.1
- google-api-python-client: 1.8.0
- google-auth: 2.34.0
- google-auth-httplib2: 0.2.0
- google-pasta: 0.2.0
- googleapis-common-protos: 1.63.2
- grpcio: 1.65.5
- gym: 0.26.2
- gym-notices: 0.0.8
- h5py: 3.11.0
- httplib2: 0.22.0
- huggingface-hub: 0.24.6
- idna: 3.7
- imageio: 2.35.1
- immutabledict: 4.2.0
- importlib-metadata: 8.3.0
- importlib-resources: 6.4.3
- ipykernel: 6.29.5
- ipython: 8.26.0
- ipython-genutils: 0.2.0
- isoduration: 20.11.0
- isort: 5.13.2
- jax: 0.4.23
- jaxlib: 0.4.23
- jedi: 0.19.1
- jinja2: 3.1.4
- jmp: 0.0.4
- joblib: 1.4.2
- jraph: 0.0.6.dev0
- json5: 0.9.25
- jsonpointer: 3.0.0
- jsonschema: 4.23.0
- jsonschema-specifications: 2023.12.1
- jupyter-client: 7.4.9
- jupyter-core: 5.7.2
- jupyter-events: 0.10.0
- jupyter-lsp: 1.5.1
- jupyter-server: 2.14.2
- jupyter-server-fileid: 0.9.2
- jupyter-server-terminals: 0.5.3
- jupyter-server-ydoc: 0.8.0
- jupyter-ydoc: 0.2.5
- jupyterlab: 3.6.7
- jupyterlab-pygments: 0.3.0
- jupyterlab-server: 2.27.3
- kagglehub: 0.2.9
- keras: 3.5.0
- keras-core: 0.1.7
- keras-cv: 0.9.0
- keras-nlp: 0.14.4
- kiwisolver: 1.4.5
- langcodes: 3.4.0
- language-data: 1.2.0
- lazy-loader: 0.4
- libclang: 18.1.1
- librosa: 0.10.2.post1
- libtpu-nightly: 0.1.dev20231213
- lightning-utilities: 0.11.7
- llvmlite: 0.43.0
- marisa-trie: 1.2.0
- markdown: 3.7
- markdown-it-py: 3.0.0
- markupsafe: 2.1.5
- matplotlib: 3.9.2
- matplotlib-inline: 0.1.7
- mccabe: 0.7.0
- mdurl: 0.1.2
- mistune: 3.0.2
- ml-dtypes: 0.3.2
- mpmath: 1.3.0
- msgpack: 1.0.8
- multidict: 6.0.5
- murmurhash: 1.0.10
- namex: 0.0.8
- nbclassic: 1.1.0
- nbclient: 0.10.0
- nbconvert: 7.16.4
- nbformat: 5.10.4
- nest-asyncio: 1.6.0
- networkx: 3.3
- notebook: 6.5.7
- notebook-shim: 0.2.4
- numba: 0.60.0
- numpy: 1.26.4
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 9.1.0.70
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-nccl-cu12: 2.20.5
- nvidia-nvjitlink-cu12: 12.6.20
- nvidia-nvtx-cu12: 12.1.105
- oauth2client: 4.1.3
- opencv-python: 4.10.0.84
- opencv-python-headless: 4.10.0.84
- opt-einsum: 3.3.0
- optax: 0.2.2
- optree: 0.12.1
- orbax-checkpoint: 0.5.16
- overrides: 7.7.0
- packaging: 24.1
- pandas: 2.2.2
- pandocfilters: 1.5.1
- papermill: 2.6.0
- parso: 0.8.4
- pexpect: 4.9.0
- pillow: 10.4.0
- pip: 23.0.1
- platformdirs: 4.2.2
- pluggy: 1.5.0
- pooch: 1.8.2
- preshed: 3.0.9
- prometheus-client: 0.20.0
- promise: 2.3
- prompt-toolkit: 3.0.47
- protobuf: 3.20.3
- psutil: 6.0.0
- ptyprocess: 0.7.0
- pure-eval: 0.2.3
- pyarrow: 17.0.0
- pyasn1: 0.6.0
- pyasn1-modules: 0.4.0
- pycodestyle: 2.11.1
- pycparser: 2.22
- pydantic: 2.8.2
- pydantic-core: 2.20.1
- pydocstyle: 6.3.0
- pyflakes: 3.2.0
- pygments: 2.18.0
- pylint: 3.2.6
- pyparsing: 3.1.2
- python-dateutil: 2.9.0.post0
- python-json-logger: 2.0.7
- python-lsp-jsonrpc: 1.1.2
- python-lsp-server: 1.11.0
- pytoolconfig: 1.3.1
- pytorch-lightning: 2.4.0
- pytz: 2024.1
- pyyaml: 6.0.2
- pyzmq: 26.1.1
- referencing: 0.35.1
- regex: 2024.7.24
- requests: 2.32.3
- rfc3339-validator: 0.1.4
- rfc3986-validator: 0.1.1
- rich: 13.7.1
- rope: 1.13.0
- rpds-py: 0.20.0
- rsa: 4.9
- safetensors: 0.4.4
- scikit-image: 0.24.0
- scikit-learn: 1.5.1
- scipy: 1.14.0
- seaborn: 0.13.2
- send2trash: 1.8.3
- setuptools: 65.5.1
- shellingham: 1.5.4
- simple-parsing: 0.1.5
- six: 1.16.0
- smart-open: 7.0.4
- sniffio: 1.3.1
- snowballstemmer: 2.2.0
- soundfile: 0.12.1
- soupsieve: 2.6
- soxr: 0.4.0
- spacy: 3.7.6
- spacy-legacy: 3.0.12
- spacy-loggers: 1.0.5
- srsly: 2.4.8
- stack-data: 0.6.3
- sympy: 1.13.2
- tabulate: 0.9.0
- tenacity: 9.0.0
- tensorboard: 2.17.1
- tensorboard-data-server: 0.7.2
- tensorflow-cpu: 2.17.0
- tensorflow-datasets: 4.9.6
- tensorflow-hub: 0.16.1
- tensorflow-io: 0.37.1
- tensorflow-io-gcs-filesystem: 0.37.1
- tensorflow-metadata: 1.15.0
- tensorflow-probability: 0.24.0
- tensorflow-text: 2.16.1
- tensorstore: 0.1.64
- termcolor: 2.4.0
- terminado: 0.18.1
- tf-keras: 2.16.0
- thinc: 8.2.5
- threadpoolctl: 3.5.0
- tifffile: 2024.8.10
- timm: 1.0.8
- tinycss2: 1.3.0
- tokenizers: 0.19.1
- toml: 0.10.2
- tomli: 2.0.1
- tomlkit: 0.13.2
- toolz: 0.12.1
- torch: 2.4.0
- torch-xla: 2.4.0+libtpu
- torchaudio: 2.4.0
- torchmetrics: 1.4.1
- torchvision: 0.19.0
- tornado: 6.4.1
- tqdm: 4.66.5
- traitlets: 5.14.3
- transformers: 4.44.0
- trax: 1.4.1
- triton: 3.0.0
- typer: 0.12.5
- types-python-dateutil: 2.9.0.20240316
- typing-extensions: 4.12.2
- tzdata: 2024.1
- ujson: 5.10.0
- uri-template: 1.3.0
- uritemplate: 3.0.1
- urllib3: 2.2.2
- wasabi: 1.1.3
- wcwidth: 0.2.13
- weasel: 0.4.1
- webcolors: 24.8.0
- webencodings: 0.5.1
- websocket-client: 1.8.0
- werkzeug: 3.0.3
- whatthepatch: 1.0.6
- wheel: 0.44.0
- wrapt: 1.16.0
- y-py: 0.6.2
- yapf: 0.40.2
- yarl: 1.9.7
- ypy-websocket: 0.8.4
- zipp: 3.20.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor:
- python: 3.10.14
- release: 6.1.42+
- version: #1 SMP PREEMPT_DYNAMIC Sun Oct 8 14:23:56 UTC 2023

More info

No response

Sep 03 '24 17:09 Bhargav230m

anyone?

Sep 04 '24 17:09 Bhargav230m

anyone?

this is not a lightning bug. i had the exactly same error on kaggle tpu v3-8 and found the fix in the kaggle product feedback discussion. here is the link: https://www.kaggle.com/discussions/product-feedback/473974 tl;dr: remove offending environment variable os.environ.pop('TPU_PROCESS_ADDRESSES')

Sep 08 '24 04:09 ibinti

Thanks @ibinti

import os
os.environ.pop('TPU_PROCESS_ADDRESSES')

Jan 09 '25 18:01 steveepreston

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

Jul 19 '25 05:07 stale[bot]

Closing issue as not a lightning problem but related to kaggle. Thanks for providing a solution @ibinti.

Sep 13 '25 11:09 SkafteNicki

pytorch-lightning pytorch-lightning copied to clipboard

RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

pytorch-lightning
pytorch-lightning copied to clipboard