aitextgen
                                
                                 aitextgen copied to clipboard
                                
                                    aitextgen copied to clipboard
                            
                            
                            
                        OOM
I got this error, no matter how big my batch_size is
01/24/2021 13:24:37 — INFO — aitextgen.TokenDataset — Encoding 3,164 sets of tokens from marian.txt. GPU available: True, used: True 01/24/2021 13:24:37 — INFO — lightning — GPU available: True, used: True No environment variable for node rank defined. Set as 0. 01/24/2021 13:24:37 — WARNING — lightning — No environment variable for node rank defined. Set as 0. CUDA_VISIBLE_DEVICES: [0] 01/24/2021 13:24:37 — INFO — lightning — CUDA_VISIBLE_DEVICES: [0]
0% 0/20000 [00:00<?, ?it/s]
RuntimeError                              Traceback (most recent call last)
21 frames /usr/local/lib/python3.6/dist-packages/transformers/modeling_utils.py in forward(self, x) 1710 def forward(self, x): 1711 size_out = x.size()[:-1] + (self.nf,) -> 1712 x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight) 1713 x = x.view(*size_out) 1714 return x
RuntimeError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 11.17 GiB total capacity; 10.71 GiB already allocated; 8.81 MiB free; 10.71 GiB reserved in total by PyTorch)
I am getting this same error too.
This appears to solve the problem?
Is this using the 124M GPT-2?
Yes.
Also, as I said in a previous comment, replacing the dependencies section at the top of the notebook with the comment here: https://github.com/minimaxir/aitextgen/issues/87#issuecomment-770431967 resolves this issue. So it's probably a dependency problem somewhere, and to fix it the Colab fine-tune notebook template needs to be updated.
Try again using 0.4.0, there are some underlying changes to transformers/pytorch-lightning that might handle this better.
Looks like this is resolved!
Yeah, I keep encountering this error. I don't have a crazy number of characters — ~20,000 — so I don't exactly know why this is happening. If there is a way I can obtain logs to drop them here, I'd be more than happy to do so.
Edit: I compared using the 124M model vs 355M model. With the 124M model, training initiated successfully. With the 355M model, I got the RuntimeError: CUDA out of memory. Is this a new restriction on free-tier Google Colab users? I do notice that the error popped up before with the 124M model. It would be interesting for others to try using the 355M model and see if an OOM error is raised. I was able to reproduce this on aitextgen as well as two different versions of the other Colab notebook utilizing gpt-2-simple.