aitextgen icon indicating copy to clipboard operation
aitextgen copied to clipboard

Colab OOM Finetuning GPT-Neo both 125M and 350M on both T4 and P100

Open redthing1 opened this issue 3 years ago • 5 comments

Using Colab: I get OOM Finetuning GPT-Neo both 125M and 350M on both T4 and P100. Even when I enable fp16 this problem persists. GPT2 works fine on the other hand.

/usr/local/lib/python3.7/dist-packages/transformers/models/gpt_neo/modeling_gpt_neo.py in _attn(self, query, key, value, causal_mask, masked_bias, attn_dropout, attention_mask, head_mask)
    235 
    236         attn_weights = torch.matmul(query, key.transpose(-1, -2))
--> 237         attn_weights = torch.where(causal_mask, attn_weights, masked_bias.to(attn_weights.dtype))
    238 
    239         if attention_mask is not None:

RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 15.90 GiB total capacity; 14.85 GiB already allocated; 61.75 MiB free; 14.96 GiB reserved in total by PyTorch)


redthing1 avatar Apr 22 '21 05:04 redthing1

That's weird. Are you changing any other training settings?

minimaxir avatar Apr 28 '21 16:04 minimaxir

That's weird. Are you changing any other training settings?

everything else is defaults. i tried again using a fresh copy of your notebook and 125M now works, but350M still OOMs.

redthing1 avatar Apr 29 '21 23:04 redthing1

Having this issue too. If it matters, I'm using a pretty large text file (~20 MB) as the dataset, and I'm also getting this warning a short while after training starts:

Token indices sequence length is longer than the specified maximum sequence length for this model (2385 > 2048). Running this sequence through the model will result in indexing errors

This also happened in my attempts to train GPT-Neo locally, so it doesn't seem like it's endemic to Colab.

johnnymcmike avatar Jul 26 '21 18:07 johnnymcmike

Having this issue too. If it matters, I'm using a pretty large text file (~20 MB) as the dataset, and I'm also getting this warning a short while after training starts:

Token indices sequence length is longer than the specified maximum sequence length for this model (2385 > 2048). Running this sequence through the model will result in indexing errors

This also happened in my attempts to train GPT-Neo locally, so it doesn't seem like it's endemic to Colab.

That error just looks like one of your training samples has a token count that is too large, not the same as a GPU OOM. I recommend using the tokenizer function to find whatever sequence is causing that.

redthing1 avatar Aug 27 '21 20:08 redthing1

Alright, I'll check that out, but I am also definitely OOMing

johnnymcmike avatar Aug 28 '21 03:08 johnnymcmike