xla icon indicating copy to clipboard operation
xla copied to clipboard

Runtime is already initialized. Do not use the XLA ' RuntimeError: Runtime is already initialized. Do not use the XLA device before calling xmp.spawn.

Open pojoba02 opened this issue 6 months ago • 3 comments

🐛 Bug

-- Block 13 ALT: Direct xmp.spawn (Consolidated) --- torch_xla and xmp imported for Block 13. Defining hyperparameters for training function... Hyperparameters for training function defined. Setting XLA/TPU specific environment variables for xmp.spawn... XRT_TPU_CONFIG already set: localservice;0;localhost:51011 Environment variables set. Arguments tuple for xmp.spawn's target function prepared. Set TPU_NUM_DEVICES = 8 Using nprocs = None (None = use all available devices) for xmp.spawn.

🚀 Launching TPU training directly via xmp.spawn with nprocs=None (auto-detect devices)... ❌❌❌ xmp.spawn FAILED: Runtime ALREADY initialized. /tmp/ipykernel_10/3843059188.py:91: UserWarning: tpu_cores not found or invalid from Block 0/1. Defaulting to 8 for TPU v3-8. warnings.warn("tpu_cores not found or invalid from Block 0/1. Defaulting to 8 for TPU v3-8.") Traceback (most recent call last): File "/tmp/ipykernel_10/3843059188.py", line 103, in xmp.spawn( File "/usr/local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 39, in spawn return pjrt.spawn(fn, nprocs, start_method, args) File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 213, in spawn run_multiprocess(spawn_fn, start_method=start_method) File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 145, in run_multiprocess raise RuntimeError('Runtime is already initialized. Do not use the XLA ' RuntimeError: Runtime is already initialized. Do not use the XLA device before calling xmp.spawn. Ensuring WandB run is finished... Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) ✅ Block 13 ALT Completed (Direct xmp.spawn Attempted).

To Reproduce

I have working on this problem for the past two weeks and l can't get my head over it, i really don't know what am doing wrong. My question if you are using tpu vm v3-8 in kaggle, does it mean you can't "!pip install "torch~=2.6.0" "torchvision~=0.21.0" "torch_xla[tpu]~=2.6.0" -f https://storage.googleapis.com/libtpu-releases/index.html --quiet" in your kaggle notebook print("PyTorch/XLA installation attempt complete.\n") it any unique install pytorch/xla? I initially started with notebook_launcher, accelerator from huggingface.

Steps to reproduce the behavior:

Expected behavior

Environment

  • Torch: 2.6.0+cu124
  • TorchXLA: 2.6.0+libtpu

Additional context

pojoba02 avatar Jun 06 '25 01:06 pojoba02

Thank you for your question. Unfortunately, I'm not familiar with kaggle. @tengyifei @bhavya01 Do you have any thoughts on this?

ysiraichi avatar Jun 07 '25 15:06 ysiraichi

Can you share more details about your notebook? It seems like this error occurs due to accessing some torch_xla code before calling xmp.spawn

bhavya01 avatar Jun 10 '25 22:06 bhavya01

@bhavya01 thanks for your response, absolutely does, what fascinatingly was it seem whenever i "#from transformers import (.....) or just import transformers " and later run xmp.spawn or notebook_launcher, i do get that issue so kaggle with tpu vm v3-8 model= google/gemma-3-4b-it i think kaggle tpu have default pytorch import torch import torch_xla print(f"PyTorch version: {torch.version}") print(f"PyTorch/xla version: {torch_xla.version}")

/usr/local/lib/python3.10/site-packages/torch_xla/init.py:251: UserWarning: tensorflow can conflict with torch-xla. Prefer tensorflow-cpu when using PyTorch/XLA. To silence this warning, pip uninstall -y tensorflow && pip install tensorflow-cpu. If you are in a notebook environment such as Colab or Kaggle, restart your notebook runtime afterwards. warnings.warn( PyTorch version: 2.6.0+cu124 PyTorch/xla version: 2.6.0+libtpu

pojoba02 avatar Jun 10 '25 23:06 pojoba02