trax
trax copied to clipboard
LSH Reformer with multiple hashes not possible on TPU
Description
When attempting to train a Reformer with LSH Attention with n_hashes > 1 on a TPU, training will get stuck, and the trainer is not able to complete even a single training step.
Environment information
Google Colab VM
Steps to reproduce
Open up for example this colab: https://colab.research.google.com/github/google/trax/blob/master/trax/models/reformer/text_generation.ipynb
Set LSH n_hashes to 2 in the gin config and leave everything else as is.
Set accelerator to TPU and attempt to train one step, it will get stuck.
Set accelerator to GPU and retry, training will run normally.
I've also had a similar issue with running the Reformer on a Colab TPU, using this gin config: https://github.com/google/trax/blob/master/trax/supervised/configs/reformer_imagenet64.gin
which also seems to use n_hashes > 1.
It seems to be some tpu driver bug (I don't have details on that).
I managed to fix the problem by requesting a different version of tpu_driver
, so in case of your notebook you should change:
url = 'http://' + os.environ['COLAB_TPU_ADDR'].split(':')[0] + ':8475/requestversion/tpu_driver0.1-dev20191206'
to
url = 'http://' + os.environ['COLAB_TPU_ADDR'].split(':')[0] + ':8475/requestversion/tpu_driver_nightly'
in the first cell of your colab. Hope that solves your problem too.