flair
flair copied to clipboard
[Bug]: Cuda OOM when training NER models with Transformer embeddings and biLSTM-CRF
Describe the bug
I keep running into Cuda OOM problems when training NER models using Transformer embeddings and biLSTM-CRF. The trainer can't even get through 1 epoch. It's weird since I used the same script with much bigger datasets and it always worked.
I put the training part of my script in the To Reproduce
section, and the error log in Logs and Stack traces
.I launch the script with
python flair_bilstm_crf_cv.py --embed_type trans --embed_path "camembert-base" --load_dates --learning_rate 0.025 --min_learning_rate 0.005 --mini_batch_size 4
My corpus train:dev:test = 5.8k : 300 : 300
I tried with 3 models camembert-base
bert-multilingual-uncased
and xlm-roberta-base
, always the same problem. I also adjusted the max_split_size_mb
to 128mb as suggested in the error message.
This has been bugging me for weeks :( Really appreciate any help from people who've encountered this before.
Thanks in advance !
To Reproduce
from flair.data import Corpus
from flair.datasets.sequence_labeling import MultiFileColumnCorpus
from flair.embeddings import TransformerWordEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
import os
import argparse
import torch
def net_tag(tag):
if tag in ['0', '<unk>', '_']:
return 'O'
else:
return tag
parser = argparse.ArgumentParser()
parser.add_argument("--embed_path", type=str, help="embedding file")
parser.add_argument("--model_path", type=str, help="model file", default="")
parser.add_argument("--load_davinci", help="load davinci file", action='store_true')
parser.add_argument("--load_dates", help="load altered dates file", action='store_true')
parser.add_argument("--learning_rate", type=float, help="initial learning rate", default=0.1)
parser.add_argument("--min_learning_rate", type=float, help="initial learning rate", default=0.0001)
parser.add_argument("--starting_at_fold", type=int, help="index of fold to start with (e.g. in case of previous interrupted training)", default=0)
parser.add_argument("--max_epochs", type=int, help="max epochs", default=100)
parser.add_argument("--embed_type", type=str, help="embedding type")
parser.add_argument("--mini_batch_size", type=int, help="mini batch size", default=8)
parser.add_argument("--continue_training", help="if continue training a model", action='store_true')
args = parser.parse_args()
if not args.model_path:
model_path = args.embed_path
else:
model_path = args.model_path
if not args.continue_training:
if args.load_davinci:
model_path += '_davinci'
if args.load_dates:
model_path += "_dates"
model_path += '_bilstm_crf_cv'
embeddings = TransformerWordEmbeddings(args.embed_path,
layers='-1,-2,-3,-4',
layer_mean=True,
subtoken_pooling='mean',
is_document_embedding=False,
force_device='cpu')
columns = {0 : 'text', 1 : 'pos', 2: 'ner'}
corpus: Corpus = MultiFileColumnCorpus(columns,
train_files = train_files,
test_files = [os.path.join(data_folder, test)],
dev_files = [os.path.join(data_folder, dev)],
in_memory=False)
gold_corpus: Corpus = MultiFileColumnCorpus(columns, test_files = ['./data/gold_2011.conll'], in_memory=False)
y_test = [[net_tag(token.get_labels()[-1].value) for token in sentence] for sentence in corpus.test]
# tag to predict
label_type = 'ner'
# make tag dictionary from the corpus
tag_dictionary = corpus.make_label_dictionary(label_type=label_type)
if not args.continue_training:
tagger : SequenceTagger = SequenceTagger(hidden_size=512,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=label_type,
use_crf=True)
else:
tagger : SequenceTagger = SequenceTagger.load(os.path.join(args.model_path, 'best-model.pt'))
trainer : ModelTrainer = ModelTrainer(tagger, corpus)
if args.continue_training:
trainer_path = args.model_path
else:
trainer_path = model_path + '_'
trainer.train(trainer_path,
learning_rate=args.learning_rate,
mini_batch_size=args.mini_batch_size,
mini_batch_chunk_size=2,
max_epochs=args.max_epochs,
min_learning_rate=args.min_learning_rate,
save_final_model=False,
exclude_labels=['<unk>'])
del tagger
del trainer
torch.cuda.empty_cache()
Expected behavior
It's supposed to just train I guess ? Maybe with more time since I reduced many parameter values but it should work.
Logs and Stack traces
2023-09-28 15:29:33,151 epoch 1 - iter 1015/1457 - loss 0.09511951 - samples/sec: 4.57 - lr: 0.025000
2023-09-28 15:31:19,328 epoch 1 - iter 1160/1457 - loss 0.09109652 - samples/sec: 5.66 - lr: 0.025000
2023-09-28 15:33:06,055 epoch 1 - iter 1305/1457 - loss 0.08588031 - samples/sec: 5.61 - lr: 0.025000
Traceback (most recent call last):
File "flair_bilstm_crf_cv.py", line 173, in <module>
trainer.train(trainer_path,
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/flair/trainers/trainer.py", line 500, in train
loss = self.model.forward_loss(batch_step)
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/flair/models/sequence_tagger_model.py", line 270, in forward_loss
scores, gold_labels = self.forward(sentences) # type: ignore
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/flair/models/sequence_tagger_model.py", line 282, in forward
self.embeddings.embed(sentences)
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/flair/embeddings/base.py", line 62, in embed
self._add_embeddings_internal(data_points)
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/flair/embeddings/base.py", line 766, in _add_embeddings_internal
self._add_embeddings_to_sentences(expanded_sentences)
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/flair/embeddings/base.py", line 692, in _add_embeddings_to_sentences
hidden_states = self.model(input_ids, **model_kwargs)[-1]
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/transformers/models/camembert/modeling_camembert.py", line 903, in forward
encoder_outputs = self.encoder(
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/transformers/models/camembert/modeling_camembert.py", line 540, in forward
layer_outputs = layer_module(
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/transformers/models/camembert/modeling_camembert.py", line 467, in forward
layer_output = apply_chunking_to_forward(
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/transformers/pytorch_utils.py", line 249, in apply_chunking_to_forward
return forward_fn(*input_tensors)
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/transformers/models/camembert/modeling_camembert.py", line 480, in feed_forward_chunk
layer_output = self.output(intermediate_output, attention_output)
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/transformers/models/camembert/modeling_camembert.py", line 392, in forward
hidden_states = self.dropout(hidden_states)
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/torch/nn/modules/dropout.py", line 59, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/share/home/cao/.conda/envs/flair_trans/lib/python3.8/site-packages/torch/nn/functional.py", line 1252, in dropout
return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 110.00 MiB (GPU 0; 47.54 GiB total capacity; 45.52 GiB already allocated; 77.12 MiB free; 46.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Screenshots
No response
Additional Context
No response
Environment
Versions:
Flair
0.11.3
Pytorch
1.13.1
Transformers
4.26.1
GPU
True
Hi @DanrunFR can you check the token-length of the sentences in your dataset? If you have a few very long sentences, you might not be able to use them for training.
Besides that, I would recommend to either finetune the transformers embeddings or use a Bi-Lstm, but not both, as that usually doesn't work too well.
You can simply pass finetune=False
to the transformer embeddings, so the Bi-LSTM can have a stable input and you will save some GPU memory
Hi, Thank you for your response ! This being part of a comparative study of embedding methods, I can't really modify the training set. I'll have to find a way to make it work even with a couple of long sentences. Although I did not know that by default transformer models are finetuned even with BiLSTM. I'll try passing the finetune parameter, we'll see how it goes !