aitextgen icon indicating copy to clipboard operation
aitextgen copied to clipboard

OOM

Open RoboticFreeze opened this issue 4 years ago • 8 comments

I got this error, no matter how big my batch_size is

01/24/2021 13:24:37 — INFO — aitextgen.TokenDataset — Encoding 3,164 sets of tokens from marian.txt. GPU available: True, used: True 01/24/2021 13:24:37 — INFO — lightning — GPU available: True, used: True No environment variable for node rank defined. Set as 0. 01/24/2021 13:24:37 — WARNING — lightning — No environment variable for node rank defined. Set as 0. CUDA_VISIBLE_DEVICES: [0] 01/24/2021 13:24:37 — INFO — lightning — CUDA_VISIBLE_DEVICES: [0]

0% 0/20000 [00:00<?, ?it/s]

RuntimeError Traceback (most recent call last) in () 6 save_gdrive=False, 7 learning_rate=1e-4, ----> 8 batch_size=32, 9 )

21 frames /usr/local/lib/python3.6/dist-packages/transformers/modeling_utils.py in forward(self, x) 1710 def forward(self, x): 1711 size_out = x.size()[:-1] + (self.nf,) -> 1712 x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight) 1713 x = x.view(*size_out) 1714 return x

RuntimeError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 11.17 GiB total capacity; 10.71 GiB already allocated; 8.81 MiB free; 10.71 GiB reserved in total by PyTorch)

RoboticFreeze avatar Jan 24 '21 13:01 RoboticFreeze

I am getting this same error too.

redthing1 avatar Feb 04 '21 06:02 redthing1

This appears to solve the problem?

redthing1 avatar Feb 04 '21 07:02 redthing1

Is this using the 124M GPT-2?

minimaxir avatar Feb 11 '21 03:02 minimaxir

Yes.

redthing1 avatar Feb 11 '21 04:02 redthing1

Also, as I said in a previous comment, replacing the dependencies section at the top of the notebook with the comment here: https://github.com/minimaxir/aitextgen/issues/87#issuecomment-770431967 resolves this issue. So it's probably a dependency problem somewhere, and to fix it the Colab fine-tune notebook template needs to be updated.

redthing1 avatar Feb 11 '21 04:02 redthing1

Try again using 0.4.0, there are some underlying changes to transformers/pytorch-lightning that might handle this better.

minimaxir avatar Feb 23 '21 05:02 minimaxir

Looks like this is resolved!

redthing1 avatar Apr 22 '21 23:04 redthing1

Yeah, I keep encountering this error. I don't have a crazy number of characters — ~20,000 — so I don't exactly know why this is happening. If there is a way I can obtain logs to drop them here, I'd be more than happy to do so.

Edit: I compared using the 124M model vs 355M model. With the 124M model, training initiated successfully. With the 355M model, I got the RuntimeError: CUDA out of memory. Is this a new restriction on free-tier Google Colab users? I do notice that the error popped up before with the 124M model. It would be interesting for others to try using the 355M model and see if an OOM error is raised. I was able to reproduce this on aitextgen as well as two different versions of the other Colab notebook utilizing gpt-2-simple.

Mennaruuk avatar Mar 08 '22 23:03 Mennaruuk