pytorch-lightning TPU Compiler issue with PyTorch 1.11

🐛 Bug

While the running the bug_report_model on torch 1.11 with PL, we face the below error

Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 19.67it/s, loss=-1.05, v_num=0]/home/kaushikbokka/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use `self.log('valid_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
  warning_cache.warn(
2022-08-09 05:20:06.635795: F ./tensorflow/core/tpu/tpu_executor_init_fns.inc:148] TpuCompiler_DefaultDeviceShapeRepresentation not available in this library.
https://symbolize.stripped_domain/r/?trace=7f7d6063103b,7f7d606310bf,7f7a6166d64c,7f7a61ae72b3,7f7d609f3b89&map= 
*** SIGABRT received by PID 13038 (TID 13038) on cpu 74 from PID 13038; stack trace: ***
PC: @     0x7f7d6063103b  (unknown)  raise
    @     0x7f7c56dd6cda        992  (unknown)
    @     0x7f7d606310c0  (unknown)  (unknown)
    @     0x7f7a6166d64d        416  tensorflow::tpu::(anonymous namespace)::(anonymous namespace)::SetExecutorStructFn()
    @     0x7f7a61ae72b4        544  tensorflow::tpu::(anonymous namespace)::FindAndLoadTpuLibrary()
    @     0x7f7d609f3b8a  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f7d6063103b,7f7c56dd6cd9,7f7d606310bf,7f7a6166d64c,7f7a61ae72b3,7f7d609f3b89&map=50c831e765011c7eb7163b7f3cb5c4b6:7f7c4862c000-7f7c57144f00

To Reproduce

Spawn up a TPU machine with torch 1.11
Install the latest PL version
Run the bug_report_model.py script

Expected behavior

Run smoothly without the compiler issue (works with torch 1.10)

Environment

Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
PyTorch Lightning Version (e.g., 1.5.0):
Lightning App Version (e.g., 0.5.2):
PyTorch Version (e.g., 1.10):
Python version (e.g., 3.9):
OS (e.g., Linux):
CUDA/cuDNN version:
GPU models and configuration:
How you installed PyTorch (conda, pip, source):
If compiling from source, the output of torch.__config__.show():
Running environment of LightningApp (e.g. local, cloud):
Any other relevant information:

Additional context

cc @tchaton @rohitgr7 @kaushikb11

Aug 09 '22 05:08 kaushikb11

I am experiencing the same issue with our Lightning Trainer, but even before the training gets to start. It also works with the TPU VM image for Pytorch 1.10.

Aug 31 '22 11:08 jhoareau

Hi @jhoareau!

This is likely a liptpu version mismatch. Could you try the below steps?

sudo rm -rf /usr/local/lib/python3.8/dist-packages/libtpu*
sudo pip3 install torch_xla[tpuvm]

Sep 05 '22 09:09 kaushikb11

pytorch-lightning pytorch-lightning copied to clipboard

TPU Compiler issue with PyTorch 1.11

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

pytorch-lightning
pytorch-lightning copied to clipboard