pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

TPU Compiler issue with PyTorch 1.11

Open kaushikb11 opened this issue 3 years ago • 0 comments

🐛 Bug

While the running the bug_report_model on torch 1.11 with PL, we face the below error

Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 19.67it/s, loss=-1.05, v_num=0]/home/kaushikbokka/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use `self.log('valid_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
  warning_cache.warn(
2022-08-09 05:20:06.635795: F ./tensorflow/core/tpu/tpu_executor_init_fns.inc:148] TpuCompiler_DefaultDeviceShapeRepresentation not available in this library.
https://symbolize.stripped_domain/r/?trace=7f7d6063103b,7f7d606310bf,7f7a6166d64c,7f7a61ae72b3,7f7d609f3b89&map= 
*** SIGABRT received by PID 13038 (TID 13038) on cpu 74 from PID 13038; stack trace: ***
PC: @     0x7f7d6063103b  (unknown)  raise
    @     0x7f7c56dd6cda        992  (unknown)
    @     0x7f7d606310c0  (unknown)  (unknown)
    @     0x7f7a6166d64d        416  tensorflow::tpu::(anonymous namespace)::(anonymous namespace)::SetExecutorStructFn()
    @     0x7f7a61ae72b4        544  tensorflow::tpu::(anonymous namespace)::FindAndLoadTpuLibrary()
    @     0x7f7d609f3b8a  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f7d6063103b,7f7c56dd6cd9,7f7d606310bf,7f7a6166d64c,7f7a61ae72b3,7f7d609f3b89&map=50c831e765011c7eb7163b7f3cb5c4b6:7f7c4862c000-7f7c57144f00 

To Reproduce

  1. Spawn up a TPU machine with torch 1.11
  2. Install the latest PL version
  3. Run the bug_report_model.py script

Expected behavior

Run smoothly without the compiler issue (works with torch 1.10)

Environment

  • Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
  • PyTorch Lightning Version (e.g., 1.5.0):
  • Lightning App Version (e.g., 0.5.2):
  • PyTorch Version (e.g., 1.10):
  • Python version (e.g., 3.9):
  • OS (e.g., Linux):
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source):
  • If compiling from source, the output of torch.__config__.show():
  • Running environment of LightningApp (e.g. local, cloud):
  • Any other relevant information:

Additional context

cc @tchaton @rohitgr7 @kaushikb11

kaushikb11 avatar Aug 09 '22 05:08 kaushikb11

I am experiencing the same issue with our Lightning Trainer, but even before the training gets to start. It also works with the TPU VM image for Pytorch 1.10.

jhoareau avatar Aug 31 '22 11:08 jhoareau

Hi @jhoareau!

This is likely a liptpu version mismatch. Could you try the below steps?

sudo rm -rf /usr/local/lib/python3.8/dist-packages/libtpu*
sudo pip3 install torch_xla[tpuvm]

kaushikb11 avatar Sep 05 '22 09:09 kaushikb11