TPU UserWarning: The model layers do not match after moving to the target device. Also loading model not work.

Open Mimocro opened this issue 4 years ago • 0 comments

I modified a bit this https://github.com/minimaxir/aitextgen/pull/105#issuecomment-812918241 colab, link https://colab.research.google.com/drive/1PpkQuZUEC42NfQhXN_EVrfxuDtzksFSI?usp=sharing, train working, 8 tpu cores are 2 it/sec on +-115M model, 1 tpu core are about 20sec/it with same config of model, and on gpu T4 its 1.05it/sec. But when i load my model from local directory (actually from gdrive but its not so matter), aitextgen start downloading another gpt model. If create model all looks ok, if don't look at � in generated text of course (maybe it's because my bad vocab). And if start training output be like:

06/15/2021 07:38:22 — INFO — aitextgen.TokenDataset — Encoding 24,334 sets of tokens from /content/drive/MyDrive/Loll/text2.txt.
06/15/2021 07:38:36 — INFO — pytorch_lightning.utilities.distributed — GPU available: False, used: False
06/15/2021 07:38:36 — INFO — pytorch_lightning.utilities.distributed — TPU available: True, using: 8 TPU cores
06/15/2021 07:38:36 — INFO — pytorch_lightning.utilities.distributed — IPU available: False, using: 0 IPUs

/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The model layers do not match after moving to the target device. If your model employs weight sharing on TPU, please tie your weights using the `on_post_move_to_device` model hook.
Layer count: [Before: 196 After: 197]
  warnings.warn(*args, **kwargs)
06/15/2021 07:39:48 — INFO — pytorch_lightning.utilities.distributed — Restored all states from the checkpoint file at None

Also, if load saved model to gpu its works, model can generate text. And one more thing, tpu trained models size is 485Mb, when model builded with same config on gpu are 482Mb.

Jun 15 '21 07:06 Mimocro