pawkanarek

Results 31 comments of pawkanarek

The script is working, looks like i was using wrong vm version when creating TPU, and I forgot about setting environment variables Correct way to create tpu v4-8 ``` gcloud...

I had the same problem as @windmaple: ```txt AttributeError: module 'torch_xla.distributed.spmd' has no attribute 'set_global_mesh' ``` As @alanwaketan suggested I installed nightly build of xla in fresh conda env with...

To resolve this problem ``` ImportError: /home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl3cow23materialize_cow_storageERNS_11StorageImplE ``` You have to update pytorch to nightly ```bash conda install pytorch-nightly::pytorch ``` But after this i got new problem...

@alanwaketan I think that my libtpu version is `tpu-vm-pt-2.0`, this is based on the command that I used to create my TPU v4-8. ``` gcloud compute tpus tpu-vm create my-tpu-name...

I've installed this package ``` libtpu-nightly 0.1.dev20240213 ``` and I still have the same ``` File "/home/me/miniconda3/envs/v_xla/lib/python3.10/site-packages/torch_xla/runtime.py", line 124, in xla_device return torch.device(torch_xla._XLAC._xla_get_default_device()) RuntimeError: Bad StatusOr access: INTERNAL: Failed to...

I created new machine with command ``` gcloud compute tpus tpu-vm create my-name --zone=us-central2-b --accelerator-type=v4-8 --version=tpu-ubuntu2204-base ``` installed all required packages on and now when i try to run this...

This might be irrelevant I managed to read the core dump file with `gdb` tool, but sadly I cannot find any specific errors. That's what `gdb`tool is showing me: -`bt`:...

@JackCaoG Now i created the v4-8 machine with this vm version: `tpu-vm-v4-pt-2.0` ``` gcloud compute tpus tpu-vm create myname --zone=us-central2-b --accelerator-type=v4-8 --version=tpu-vm-v4-pt-2.0 ``` And now Im getting different message, but...

Thanks for advice, sanity check looks good on this tpu imports: ``` python Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license"...

It seems that setting `export PJRT_DEVICE=TPU` and `export XLA_USE_SPMD=1` resolved the issue. I was certain I had exported the variables... The training now works though it occasionally crashes during training...