xla icon indicating copy to clipboard operation
xla copied to clipboard

Colab notebook link is broken

Open thomasj02 opened this issue 1 year ago • 10 comments

📚 Documentation

https://pytorch.org/xla/release/2.2/index.html#performance-profiling has a link to a Colab notebook that is broken (https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab/pytorch-xla-profiling-colab.ipynb)

It looks like the contrib/colab directory doesn't even exist anymore.

I'm having a tough time getting PJRT running in colab, so an example of how to do this would be really helpful.

thomasj02 avatar Feb 17 '24 23:02 thomasj02

Oh our bad. Colab is on the old TPU Node architure and does not support any release newer than PyTorch 2.0. Can you try kaggle? @zpcore can you remove the outdated link in our doc?

JackCaoG avatar Feb 20 '24 18:02 JackCaoG

You can find the kaggle example in https://github.com/pytorch/xla/tree/master/contrib/kaggle

JackCaoG avatar Feb 20 '24 18:02 JackCaoG

Thanks, I'll check out Kaggle

thomasj02 avatar Feb 20 '24 20:02 thomasj02

It seems the Kaggle notebook environment is also somewhat broken. I have no idea if this is an xla issue, a Jax issue, or something else, but here's the error: https://www.kaggle.com/discussions/product-feedback/479523

thomasj02 avatar Feb 25 '24 00:02 thomasj02

hmm @will-cromar have you ever run into this

2024-02-24 22:48:13.356169: F external/local_xla/xla/stream_executor/tpu/tpu_library_init_fns.inc:85] TpuUtil_GetXlaPadSizeFromTpuTopology not available in this library.

?

JackCaoG avatar Feb 26 '24 18:02 JackCaoG

xla/stream_executor/tpu/tpu_library_init_fns.inc looks like a very outdated libtpu to me. We dropped support for StreamExecutor on PJRT about a year ago IIRC. Is this on the current Kaggle TPU VM environment?

will-cromar avatar Feb 26 '24 18:02 will-cromar

Yes, it's the TPU VM environment that Kaggle calls "TPU VM v3-8"

Here's an example notebook: https://www.kaggle.com/code/tjohnson/notebookbf52281afd

thomasj02 avatar Feb 26 '24 19:02 thomasj02

It looks like there are two issues in your example notebook. First, the torch version isn't updated to 2.2 yet. I just sent a PR to do this: https://github.com/Kaggle/docker-python/pull/1364

Kaggle requires a special build of torch_xla that has libtpu bundled. Otherwise, it conflicts with the libtpus installed by JAX and/or TF. These builds are marked with +libtpu in our release bucket, e.g. https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.2.0%2Blibtpu-cp310-cp310-manylinux_2_28_x86_64.whl. The normal install path will not work on Kaggle.

For importing transformers, this is a known issue: https://github.com/pytorch/xla/issues/5625#issuecomment-1743493309

The workaround is to replace the TF TPU package:

!pip uninstall --yes tensorflow
!pip install tensorflow-cpu

will-cromar avatar Feb 26 '24 21:02 will-cromar

Thanks so much for looking into this @will-cromar !

I'll switch the tensorflow versions as recommended.

I noticed that the docker-python PR is failing CI, although I don't have permissions to see the details: https://github.com/Kaggle/docker-python/commits/main/

thomasj02 avatar Feb 26 '24 21:02 thomasj02

Thanks for the heads up. I'll work with the Kaggle team to get that image updated.

will-cromar avatar Feb 26 '24 23:02 will-cromar