fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Getting CUDA error when trying to train MBART Model

Open LabComputerResearch opened this issue 2 years ago • 1 comments


from transformers import MBart50TokenizerFast
from transformers import MBartForConditionalGeneration
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt",src_lang="", tgt_lang="")
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt");
batch_size = 8

args = Seq2SeqTrainingArguments(
output_dir="./resultsMBart",
evaluation_strategy = "epoch",
learning_rate=3e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
save_total_limit=3,
num_train_epochs=3,
predict_with_generate=True,
fp16=False,
report_to = "none")

trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics)
trainer.train()
RuntimeError: CUDA out of memory. Tried to allocate 978.00 MiB (GPU 0; 15.74 GiB total capacity; 13.76 GiB already allocated; 351.00 MiB free; 14.02 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid    fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have recently started working in NLP and was trying to train MBART Model using my data set but every time I set it for training,I get a CUDA error.I have tried decreasing batch size as well as killing all processes on the GPU to prevent this error but I cannot seem to figure out a solution.Would anyone have an idea on how I could fix this and train the model? The data set I am using has approximately 2 million sentences but that didn't lead to a problem when I tried using other models,so I have no idea why this is occuring,any help would be well appreciated. The GPU I am using is NVIDIA Quadro RTX 5000.

LabComputerResearch avatar Oct 19 '22 12:10 LabComputerResearch

Hi,

I guess MBart large is not going to fit in 16GB GPU memory. There are a couple of options you might try:

  • Try Mbart base rather than the large model.
  • You did not mention your input and output sequence size.
  • Your batch size is 8, reduce it to 4 or 2.
  • Try torch.cuda.empty_cache() in your training loop if not using the trainer.
  • Try to free your anaconda or HuggingFace cache on your server or machine (I know that sounds weird but it worked for me and I don't know the exact reason).
  • If you have access to multiple GPUs, allocate multiple GPUs and parallelize the model.

I hope this will help you.

MehwishFatimah avatar Oct 19 '22 13:10 MehwishFatimah