aitextgen OOM

I got this error, no matter how big my batch_size is

01/24/2021 13:24:37 — INFO — aitextgen.TokenDataset — Encoding 3,164 sets of tokens from marian.txt. GPU available: True, used: True 01/24/2021 13:24:37 — INFO — lightning — GPU available: True, used: True No environment variable for node rank defined. Set as 0. 01/24/2021 13:24:37 — WARNING — lightning — No environment variable for node rank defined. Set as 0. CUDA_VISIBLE_DEVICES: [0] 01/24/2021 13:24:37 — INFO — lightning — CUDA_VISIBLE_DEVICES: [0]

0% 0/20000 [00:00<?, ?it/s]

RuntimeError Traceback (most recent call last) in () 6 save_gdrive=False, 7 learning_rate=1e-4, ----> 8 batch_size=32, 9 )

21 frames /usr/local/lib/python3.6/dist-packages/transformers/modeling_utils.py in forward(self, x) 1710 def forward(self, x): 1711 size_out = x.size()[:-1] + (self.nf,) -> 1712 x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight) 1713 x = x.view(*size_out) 1714 return x

RuntimeError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 11.17 GiB total capacity; 10.71 GiB already allocated; 8.81 MiB free; 10.71 GiB reserved in total by PyTorch)

Jan 24 '21 13:01 RoboticFreeze

I am getting this same error too.

Feb 04 '21 06:02 redthing1

This appears to solve the problem?

Feb 04 '21 07:02 redthing1

Is this using the 124M GPT-2?

Feb 11 '21 03:02 minimaxir

Yes.

Feb 11 '21 04:02 redthing1

Also, as I said in a previous comment, replacing the dependencies section at the top of the notebook with the comment here: https://github.com/minimaxir/aitextgen/issues/87#issuecomment-770431967 resolves this issue. So it's probably a dependency problem somewhere, and to fix it the Colab fine-tune notebook template needs to be updated.

Feb 11 '21 04:02 redthing1

Try again using 0.4.0, there are some underlying changes to transformers/pytorch-lightning that might handle this better.

Feb 23 '21 05:02 minimaxir

Looks like this is resolved!

Apr 22 '21 23:04 redthing1

Yeah, I keep encountering this error. I don't have a crazy number of characters — ~20,000 — so I don't exactly know why this is happening. If there is a way I can obtain logs to drop them here, I'd be more than happy to do so.

Edit: I compared using the 124M model vs 355M model. With the 124M model, training initiated successfully. With the 355M model, I got the RuntimeError: CUDA out of memory. Is this a new restriction on free-tier Google Colab users? I do notice that the error popped up before with the 124M model. It would be interesting for others to try using the 355M model and see if an OOM error is raised. I was able to reproduce this on aitextgen as well as two different versions of the other Colab notebook utilizing gpt-2-simple.

Mar 08 '22 23:03 Mennaruuk

aitextgen aitextgen copied to clipboard

OOM

0% 0/20000 [00:00<?, ?it/s]

aitextgen
aitextgen copied to clipboard