pytorch-lightning
pytorch-lightning copied to clipboard
TPU Compiler issue with PyTorch 1.11
🐛 Bug
While the running the bug_report_model on torch 1.11 with PL, we face the below error
Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 19.67it/s, loss=-1.05, v_num=0]/home/kaushikbokka/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use `self.log('valid_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
2022-08-09 05:20:06.635795: F ./tensorflow/core/tpu/tpu_executor_init_fns.inc:148] TpuCompiler_DefaultDeviceShapeRepresentation not available in this library.
https://symbolize.stripped_domain/r/?trace=7f7d6063103b,7f7d606310bf,7f7a6166d64c,7f7a61ae72b3,7f7d609f3b89&map=
*** SIGABRT received by PID 13038 (TID 13038) on cpu 74 from PID 13038; stack trace: ***
PC: @ 0x7f7d6063103b (unknown) raise
@ 0x7f7c56dd6cda 992 (unknown)
@ 0x7f7d606310c0 (unknown) (unknown)
@ 0x7f7a6166d64d 416 tensorflow::tpu::(anonymous namespace)::(anonymous namespace)::SetExecutorStructFn()
@ 0x7f7a61ae72b4 544 tensorflow::tpu::(anonymous namespace)::FindAndLoadTpuLibrary()
@ 0x7f7d609f3b8a (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f7d6063103b,7f7c56dd6cd9,7f7d606310bf,7f7a6166d64c,7f7a61ae72b3,7f7d609f3b89&map=50c831e765011c7eb7163b7f3cb5c4b6:7f7c4862c000-7f7c57144f00
To Reproduce
- Spawn up a TPU machine with torch 1.11
- Install the latest PL version
- Run the
bug_report_model.pyscript
Expected behavior
Run smoothly without the compiler issue (works with torch 1.10)
Environment
- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
- PyTorch Lightning Version (e.g., 1.5.0):
- Lightning App Version (e.g., 0.5.2):
- PyTorch Version (e.g., 1.10):
- Python version (e.g., 3.9):
- OS (e.g., Linux):
- CUDA/cuDNN version:
- GPU models and configuration:
- How you installed PyTorch (
conda,pip, source): - If compiling from source, the output of
torch.__config__.show(): - Running environment of LightningApp (e.g. local, cloud):
- Any other relevant information:
Additional context
cc @tchaton @rohitgr7 @kaushikb11
I am experiencing the same issue with our Lightning Trainer, but even before the training gets to start. It also works with the TPU VM image for Pytorch 1.10.
Hi @jhoareau!
This is likely a liptpu version mismatch. Could you try the below steps?
sudo rm -rf /usr/local/lib/python3.8/dist-packages/libtpu*
sudo pip3 install torch_xla[tpuvm]