pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.

Open Bhargav230m opened this issue 1 year ago • 3 comments

Bug description

Trying to use TPU in Kaggle and receiving the error "RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1."

I am using 8 TPU cores, Here my Trainer:

trainer = Trainer(
    max_epochs=50,
    accelerator="tpu",
    devices=8,
    callbacks=[pl.callbacks.EarlyStopping(monitor='val_loss', patience=2)]
)

I am new to machine learning please tell me if I make mistakes

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.302361    2870 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8476 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.407367    2874 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8477 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: === 
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.442340    2878 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8478 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: ===
learning/45eac/tfrc/runtime/common_lib.cc:483
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1725383433.453311    2882 common_lib.cc:818] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8479 in any of the 0 ports provided in `tpu_process_addresses`="local"
=== Source Location Trace: === 
learning/45eac/tfrc/runtime/common_lib.cc:483
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 95, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 59, in _run_thread_per_device
    initializer_fn(local_rank, local_world_size)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 95, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 125, in initialize_multiprocess
    devices = xm.get_xla_supported_devices()
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 99, in get_xla_supported_devices
    devices = torch_xla._XLAC._xla_get_devices()
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.
"""

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[47], line 12
      1 model = ToxicCommentModel(
      2     input_size=hyperparameters["input_size"], 
      3     hidden_size=hyperparameters["linear_hidden_size"],  
   (...)
     10     max_len=hyperparameters["context_length"]
     11 )
---> 12 trainer.fit(model, data_module)

File /usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:538, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    536 self.state.status = TrainerStatus.RUNNING
    537 self.training = True
--> 538 call._call_and_handle_interrupt(
    539     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    540 )

File /usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:46, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     44 try:
     45     if trainer.strategy.launcher is not None:
---> 46         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     47     return trainer_fn(*args, **kwargs)
     49 except _TunerExitException:

File /usr/local/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/xla.py:98, in _XLALauncher.launch(self, function, trainer, *args, **kwargs)
     93 if nprocs == 1:
     94     # avoid warning: "Unsupported nprocs". If it's 1, it will call the launched function directly.
     95     # otherwise it will use all devices
     96     spawn_kwargs["nprocs"] = nprocs
---> 98 process_context = xmp.spawn(
     99     self._wrapping_function,
    100     args=(trainer, function, args, kwargs, return_queue),
    101     start_method=self._start_method,
    102     join=False,  # we will join ourselves to get the process references
    103     **spawn_kwargs,
    104 )
    105 # xla will not actually create processes if only 1 device
    106 if process_context is not None:

File /usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:95, in requires_pjrt.<locals>.wrapper(*args, **kwargs)
     91 if not using_pjrt():
     92   raise NotImplementedError('`{}` not implemented for XRT'.format(
     93       fn.__name__))
---> 95 return fn(*args, **kwargs)

File /usr/local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py:38, in spawn(fn, args, nprocs, join, daemon, start_method)
      6 @xr.requires_pjrt
      7 def spawn(fn,
      8           args=(),
   (...)
     11           daemon=False,
     12           start_method='spawn'):
     13   """Enables multi processing based replication.
     14 
     15   Args:
   (...)
     36     return None.
     37   """
---> 38   return pjrt.spawn(fn, nprocs, start_method, args)

File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:214, in spawn(fn, nprocs, start_method, args)
    211 elif nprocs is not None:
    212   logging.warning('Unsupported nprocs (%d), ignoring...' % nprocs)
--> 214 run_multiprocess(spawn_fn, start_method=start_method)

File /usr/local/lib/python3.10/site-packages/torch_xla/runtime.py:95, in requires_pjrt.<locals>.wrapper(*args, **kwargs)
     91 if not using_pjrt():
     92   raise NotImplementedError('`{}` not implemented for XRT'.format(
     93       fn.__name__))
---> 95 return fn(*args, **kwargs)

File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:174, in run_multiprocess(fn, start_method, *args, **kwargs)
    168   mp_fn = functools.partial(
    169       _run_thread_per_device,
    170       local_world_size=num_processes,
    171       fn=functools.partial(fn, *args, **kwargs),
    172       initializer_fn=initialize_multiprocess)
    173   process_results = executor.map(mp_fn, range(num_processes))
--> 174   replica_results = list(
    175       itertools.chain.from_iterable(
    176           result.items() for result in process_results))
    178 return _merge_replica_results(replica_results)

File /usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py:175, in <genexpr>(.0)
    168   mp_fn = functools.partial(
    169       _run_thread_per_device,
    170       local_world_size=num_processes,
    171       fn=functools.partial(fn, *args, **kwargs),
    172       initializer_fn=initialize_multiprocess)
    173   process_results = executor.map(mp_fn, range(num_processes))
    174   replica_results = list(
--> 175       itertools.chain.from_iterable(
    176           result.items() for result in process_results))
    178 return _merge_replica_results(replica_results)

File /usr/local/lib/python3.10/concurrent/futures/process.py:575, in _chain_from_iterable_of_lists(iterable)
    569 def _chain_from_iterable_of_lists(iterable):
    570     """
    571     Specialized implementation of itertools.chain.from_iterable.
    572     Each item in *iterable* should be a list.  This function is
    573     careful not to keep references to yielded objects.
    574     """
--> 575     for element in iterable:
    576         element.reverse()
    577         while element:

File /usr/local/lib/python3.10/concurrent/futures/_base.py:621, in Executor.map.<locals>.result_iterator()
    618 while fs:
    619     # Careful not to keep a reference to the popped future
    620     if timeout is None:
--> 621         yield _result_or_cancel(fs.pop())
    622     else:
    623         yield _result_or_cancel(fs.pop(), end_time - time.monotonic())

File /usr/local/lib/python3.10/concurrent/futures/_base.py:319, in _result_or_cancel(***failed resolving arguments***)
    317 try:
    318     try:
--> 319         return fut.result(timeout)
    320     finally:
    321         fut.cancel()

File /usr/local/lib/python3.10/concurrent/futures/_base.py:458, in Future.result(self, timeout)
    456     raise CancelledError()
    457 elif self._state == FINISHED:
--> 458     return self.__get_result()
    459 else:
    460     raise TimeoutError()

File /usr/local/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self)
    401 if self._exception:
    402     try:
--> 403         raise self._exception
    404     finally:
    405         # Break a reference cycle with the exception in self._exception
    406         self = None

RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1.

Environment

Current environment
  • CUDA:
    • GPU: None
    • available: False
    • version: 12.1
  • Lightning:
    • lightning-utilities: 0.11.7
    • pytorch-lightning: 2.4.0
    • torch: 2.4.0
    • torch-xla: 2.4.0+libtpu
    • torchaudio: 2.4.0
    • torchmetrics: 1.4.1
    • torchvision: 0.19.0
  • Packages:
    • absl-py: 2.1.0
    • accelerate: 0.33.0
    • aiofiles: 22.1.0
    • aiohappyeyeballs: 2.4.0
    • aiohttp: 3.10.5
    • aiosignal: 1.3.1
    • aiosqlite: 0.20.0
    • albucore: 0.0.13
    • albumentations: 1.4.14
    • annotated-types: 0.7.0
    • ansicolors: 1.1.8
    • anyio: 4.4.0
    • argon2-cffi: 23.1.0
    • argon2-cffi-bindings: 21.2.0
    • array-record: 0.5.1
    • arrow: 1.3.0
    • astroid: 3.2.4
    • asttokens: 2.4.1
    • astunparse: 1.6.3
    • async-timeout: 4.0.3
    • attrs: 24.2.0
    • audioread: 3.0.1
    • autopep8: 2.0.4
    • babel: 2.16.0
    • beautifulsoup4: 4.12.3
    • bleach: 6.1.0
    • blis: 0.7.11
    • cachetools: 5.5.0
    • catalogue: 2.0.10
    • certifi: 2024.7.4
    • cffi: 1.17.0
    • charset-normalizer: 3.3.2
    • chex: 0.1.86
    • click: 8.1.7
    • cloud-tpu-client: 0.10
    • cloudpathlib: 0.19.0
    • cloudpickle: 3.0.0
    • comm: 0.2.2
    • confection: 0.1.5
    • contourpy: 1.2.1
    • cramjam: 2.8.3
    • cycler: 0.12.1
    • cymem: 2.0.8
    • debugpy: 1.8.5
    • decorator: 5.1.1
    • defusedxml: 0.7.1
    • diffusers: 0.30.0
    • dill: 0.3.8
    • distrax: 0.1.5
    • dm-haiku: 0.0.13.dev0
    • dm-tree: 0.1.8
    • docstring-parser: 0.16
    • docstring-to-markdown: 0.15
    • einops: 0.8.0
    • en-core-web-sm: 3.7.1
    • entrypoints: 0.4
    • etils: 1.7.0
    • eval-type-backport: 0.2.0
    • exceptiongroup: 1.2.2
    • executing: 2.0.1
    • fastjsonschema: 2.20.0
    • fastparquet: 2024.5.0
    • filelock: 3.15.4
    • flake8: 7.0.0
    • flatbuffers: 24.3.25
    • flax: 0.8.4
    • fonttools: 4.53.1
    • fqdn: 1.5.1
    • frozenlist: 1.4.1
    • fsspec: 2024.6.1
    • funcsigs: 1.0.2
    • gast: 0.6.0
    • gin-config: 0.5.0
    • google-api-core: 1.34.1
    • google-api-python-client: 1.8.0
    • google-auth: 2.34.0
    • google-auth-httplib2: 0.2.0
    • google-pasta: 0.2.0
    • googleapis-common-protos: 1.63.2
    • grpcio: 1.65.5
    • gym: 0.26.2
    • gym-notices: 0.0.8
    • h5py: 3.11.0
    • httplib2: 0.22.0
    • huggingface-hub: 0.24.6
    • idna: 3.7
    • imageio: 2.35.1
    • immutabledict: 4.2.0
    • importlib-metadata: 8.3.0
    • importlib-resources: 6.4.3
    • ipykernel: 6.29.5
    • ipython: 8.26.0
    • ipython-genutils: 0.2.0
    • isoduration: 20.11.0
    • isort: 5.13.2
    • jax: 0.4.23
    • jaxlib: 0.4.23
    • jedi: 0.19.1
    • jinja2: 3.1.4
    • jmp: 0.0.4
    • joblib: 1.4.2
    • jraph: 0.0.6.dev0
    • json5: 0.9.25
    • jsonpointer: 3.0.0
    • jsonschema: 4.23.0
    • jsonschema-specifications: 2023.12.1
    • jupyter-client: 7.4.9
    • jupyter-core: 5.7.2
    • jupyter-events: 0.10.0
    • jupyter-lsp: 1.5.1
    • jupyter-server: 2.14.2
    • jupyter-server-fileid: 0.9.2
    • jupyter-server-terminals: 0.5.3
    • jupyter-server-ydoc: 0.8.0
    • jupyter-ydoc: 0.2.5
    • jupyterlab: 3.6.7
    • jupyterlab-pygments: 0.3.0
    • jupyterlab-server: 2.27.3
    • kagglehub: 0.2.9
    • keras: 3.5.0
    • keras-core: 0.1.7
    • keras-cv: 0.9.0
    • keras-nlp: 0.14.4
    • kiwisolver: 1.4.5
    • langcodes: 3.4.0
    • language-data: 1.2.0
    • lazy-loader: 0.4
    • libclang: 18.1.1
    • librosa: 0.10.2.post1
    • libtpu-nightly: 0.1.dev20231213
    • lightning-utilities: 0.11.7
    • llvmlite: 0.43.0
    • marisa-trie: 1.2.0
    • markdown: 3.7
    • markdown-it-py: 3.0.0
    • markupsafe: 2.1.5
    • matplotlib: 3.9.2
    • matplotlib-inline: 0.1.7
    • mccabe: 0.7.0
    • mdurl: 0.1.2
    • mistune: 3.0.2
    • ml-dtypes: 0.3.2
    • mpmath: 1.3.0
    • msgpack: 1.0.8
    • multidict: 6.0.5
    • murmurhash: 1.0.10
    • namex: 0.0.8
    • nbclassic: 1.1.0
    • nbclient: 0.10.0
    • nbconvert: 7.16.4
    • nbformat: 5.10.4
    • nest-asyncio: 1.6.0
    • networkx: 3.3
    • notebook: 6.5.7
    • notebook-shim: 0.2.4
    • numba: 0.60.0
    • numpy: 1.26.4
    • nvidia-cublas-cu12: 12.1.3.1
    • nvidia-cuda-cupti-cu12: 12.1.105
    • nvidia-cuda-nvrtc-cu12: 12.1.105
    • nvidia-cuda-runtime-cu12: 12.1.105
    • nvidia-cudnn-cu12: 9.1.0.70
    • nvidia-cufft-cu12: 11.0.2.54
    • nvidia-curand-cu12: 10.3.2.106
    • nvidia-cusolver-cu12: 11.4.5.107
    • nvidia-cusparse-cu12: 12.1.0.106
    • nvidia-nccl-cu12: 2.20.5
    • nvidia-nvjitlink-cu12: 12.6.20
    • nvidia-nvtx-cu12: 12.1.105
    • oauth2client: 4.1.3
    • opencv-python: 4.10.0.84
    • opencv-python-headless: 4.10.0.84
    • opt-einsum: 3.3.0
    • optax: 0.2.2
    • optree: 0.12.1
    • orbax-checkpoint: 0.5.16
    • overrides: 7.7.0
    • packaging: 24.1
    • pandas: 2.2.2
    • pandocfilters: 1.5.1
    • papermill: 2.6.0
    • parso: 0.8.4
    • pexpect: 4.9.0
    • pillow: 10.4.0
    • pip: 23.0.1
    • platformdirs: 4.2.2
    • pluggy: 1.5.0
    • pooch: 1.8.2
    • preshed: 3.0.9
    • prometheus-client: 0.20.0
    • promise: 2.3
    • prompt-toolkit: 3.0.47
    • protobuf: 3.20.3
    • psutil: 6.0.0
    • ptyprocess: 0.7.0
    • pure-eval: 0.2.3
    • pyarrow: 17.0.0
    • pyasn1: 0.6.0
    • pyasn1-modules: 0.4.0
    • pycodestyle: 2.11.1
    • pycparser: 2.22
    • pydantic: 2.8.2
    • pydantic-core: 2.20.1
    • pydocstyle: 6.3.0
    • pyflakes: 3.2.0
    • pygments: 2.18.0
    • pylint: 3.2.6
    • pyparsing: 3.1.2
    • python-dateutil: 2.9.0.post0
    • python-json-logger: 2.0.7
    • python-lsp-jsonrpc: 1.1.2
    • python-lsp-server: 1.11.0
    • pytoolconfig: 1.3.1
    • pytorch-lightning: 2.4.0
    • pytz: 2024.1
    • pyyaml: 6.0.2
    • pyzmq: 26.1.1
    • referencing: 0.35.1
    • regex: 2024.7.24
    • requests: 2.32.3
    • rfc3339-validator: 0.1.4
    • rfc3986-validator: 0.1.1
    • rich: 13.7.1
    • rope: 1.13.0
    • rpds-py: 0.20.0
    • rsa: 4.9
    • safetensors: 0.4.4
    • scikit-image: 0.24.0
    • scikit-learn: 1.5.1
    • scipy: 1.14.0
    • seaborn: 0.13.2
    • send2trash: 1.8.3
    • setuptools: 65.5.1
    • shellingham: 1.5.4
    • simple-parsing: 0.1.5
    • six: 1.16.0
    • smart-open: 7.0.4
    • sniffio: 1.3.1
    • snowballstemmer: 2.2.0
    • soundfile: 0.12.1
    • soupsieve: 2.6
    • soxr: 0.4.0
    • spacy: 3.7.6
    • spacy-legacy: 3.0.12
    • spacy-loggers: 1.0.5
    • srsly: 2.4.8
    • stack-data: 0.6.3
    • sympy: 1.13.2
    • tabulate: 0.9.0
    • tenacity: 9.0.0
    • tensorboard: 2.17.1
    • tensorboard-data-server: 0.7.2
    • tensorflow-cpu: 2.17.0
    • tensorflow-datasets: 4.9.6
    • tensorflow-hub: 0.16.1
    • tensorflow-io: 0.37.1
    • tensorflow-io-gcs-filesystem: 0.37.1
    • tensorflow-metadata: 1.15.0
    • tensorflow-probability: 0.24.0
    • tensorflow-text: 2.16.1
    • tensorstore: 0.1.64
    • termcolor: 2.4.0
    • terminado: 0.18.1
    • tf-keras: 2.16.0
    • thinc: 8.2.5
    • threadpoolctl: 3.5.0
    • tifffile: 2024.8.10
    • timm: 1.0.8
    • tinycss2: 1.3.0
    • tokenizers: 0.19.1
    • toml: 0.10.2
    • tomli: 2.0.1
    • tomlkit: 0.13.2
    • toolz: 0.12.1
    • torch: 2.4.0
    • torch-xla: 2.4.0+libtpu
    • torchaudio: 2.4.0
    • torchmetrics: 1.4.1
    • torchvision: 0.19.0
    • tornado: 6.4.1
    • tqdm: 4.66.5
    • traitlets: 5.14.3
    • transformers: 4.44.0
    • trax: 1.4.1
    • triton: 3.0.0
    • typer: 0.12.5
    • types-python-dateutil: 2.9.0.20240316
    • typing-extensions: 4.12.2
    • tzdata: 2024.1
    • ujson: 5.10.0
    • uri-template: 1.3.0
    • uritemplate: 3.0.1
    • urllib3: 2.2.2
    • wasabi: 1.1.3
    • wcwidth: 0.2.13
    • weasel: 0.4.1
    • webcolors: 24.8.0
    • webencodings: 0.5.1
    • websocket-client: 1.8.0
    • werkzeug: 3.0.3
    • whatthepatch: 1.0.6
    • wheel: 0.44.0
    • wrapt: 1.16.0
    • y-py: 0.6.2
    • yapf: 0.40.2
    • yarl: 1.9.7
    • ypy-websocket: 0.8.4
    • zipp: 3.20.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor:
    • python: 3.10.14
    • release: 6.1.42+
    • version: #1 SMP PREEMPT_DYNAMIC Sun Oct 8 14:23:56 UTC 2023

More info

No response

Bhargav230m avatar Sep 03 '24 17:09 Bhargav230m

anyone?

Bhargav230m avatar Sep 04 '24 17:09 Bhargav230m

anyone?

this is not a lightning bug. i had the exactly same error on kaggle tpu v3-8 and found the fix in the kaggle product feedback discussion. here is the link: https://www.kaggle.com/discussions/product-feedback/473974 tl;dr: remove offending environment variable os.environ.pop('TPU_PROCESS_ADDRESSES')

ibinti avatar Sep 08 '24 04:09 ibinti

Thanks @ibinti

import os
os.environ.pop('TPU_PROCESS_ADDRESSES')

steveepreston avatar Jan 09 '25 18:01 steveepreston

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

stale[bot] avatar Jul 19 '25 05:07 stale[bot]

Closing issue as not a lightning problem but related to kaggle. Thanks for providing a solution @ibinti.

SkafteNicki avatar Sep 13 '25 11:09 SkafteNicki