fairseq Getting CUDA error when trying to train MBART Model

Getting CUDA error when trying to train MBART Model

Open LabComputerResearch opened this issue 2 years ago • 1 comments


from transformers import MBart50TokenizerFast
from transformers import MBartForConditionalGeneration
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt",src_lang="", tgt_lang="")
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt");
batch_size = 8

args = Seq2SeqTrainingArguments(
output_dir="./resultsMBart",
evaluation_strategy = "epoch",
learning_rate=3e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
save_total_limit=3,
num_train_epochs=3,
predict_with_generate=True,
fp16=False,
report_to = "none")

trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics)
trainer.train()

RuntimeError: CUDA out of memory. Tried to allocate 978.00 MiB (GPU 0; 15.74 GiB total capacity; 13.76 GiB already allocated; 351.00 MiB free; 14.02 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid    fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have recently started working in NLP and was trying to train MBART Model using my data set but every time I set it for training,I get a CUDA error.I have tried decreasing batch size as well as killing all processes on the GPU to prevent this error but I cannot seem to figure out a solution.Would anyone have an idea on how I could fix this and train the model? The data set I am using has approximately 2 million sentences but that didn't lead to a problem when I tried using other models,so I have no idea why this is occuring,any help would be well appreciated. The GPU I am using is NVIDIA Quadro RTX 5000.

Oct 19 '22 12:10 LabComputerResearch

Hi,

I guess MBart large is not going to fit in 16GB GPU memory. There are a couple of options you might try:

Try Mbart base rather than the large model.
You did not mention your input and output sequence size.
Your batch size is 8, reduce it to 4 or 2.
Try torch.cuda.empty_cache() in your training loop if not using the trainer.
Try to free your anaconda or HuggingFace cache on your server or machine (I know that sounds weird but it worked for me and I don't know the exact reason).
If you have access to multiple GPUs, allocate multiple GPUs and parallelize the model.

I hope this will help you.

Oct 19 '22 13:10 MehwishFatimah

fairseq fairseq copied to clipboard

Getting CUDA error when trying to train MBART Model

fairseq
fairseq copied to clipboard