flair
flair copied to clipboard
Automated Concatenation of Embeddings
adds ACE, from this repo: https://github.com/Alibaba-NLP/ACE currently not tested on larger training, but at least it runs through when using CONLL_03 downsampled to 0.005 on each dataset.
Inner working:
while the original masks out embeddings, I decided to precompute them once at the start and create several models which different concatinations. As StackedEmbeddings overwrite the embedding name, I cannot use it. So AceEmbeddings
act as fake dummy to load the precomputed embeddings with the right size. For each "Episode", a new Model is created which will be initialized with the parameters of the previous models. Only the imput weight will be modified to only use the nodes connected to the respective embeddings. (e.g. if you trained on a size of 7500 as full embedding, the first layer will be of size 7500*x. The Second training only uses embeddings for size 3750, so only parts of the weight of the first layer will be used (3750*x))
At the end, a final model will be created and returned by the trainer.train
method.
Usage on CONLL:
Currently, I use the following for training. (besides downsample the CONLL dataset). If someone has time and resources to try to reproduce the actual results of the ACE and share results, this would be very welcomed :-)
embeddings = [
WordEmbeddings("glove"),
FlairEmbeddings("news-forward"),
FlairEmbeddings("news-backward"),
WordEmbeddings("crawl"),
TransformerWordEmbeddings("bert-base-cased", layers="-1,-2,-3,-4", layer_mean=True),
TransformerWordEmbeddings(
"bert-base-cased", layers="-1,-2,-3,-4", use_context=True, name="bert-base-cased-context", layer_mean=True
),
TransformerWordEmbeddings("xlm-roberta-base", layers="-1,-2,-3,-4", layer_mean=True),
TransformerWordEmbeddings(
"xlm-roberta-base", layers="-1,-2,-3,-4", use_context=True, name="xlm-roberta-base-context", layer_mean=True
),
]
corpus = CONLL_03()
label_type = "ner"
label_dict = corpus.make_label_dictionary(label_type=label_type)
trainer = AceTrainer(
corpus,
embeddings=embeddings,
model_args=dict(
hidden_size=800, tag_dictionary=label_dict, tag_type=label_type, use_crf=True, reproject_embeddings=2048
),
)
model, history = trainer.train("resources/ace", inner_train_args=dict(learning_rate=0.1, mini_batch_size=32, max_epochs=150))
considerations
The way it is currently implemented, the Embeddings are be expected to be precomputable and stored on cpu ram. This leads to problems (for strange reasons also for my GPU ram). I will be testing using only half of the training set of CONLL_03.
I am currently not sure how good the practical use really is. It is neat if you have much ram and lots of time, that you can finetune plenty of FLERT-like models (Bert, Roberta, Deberta, XLM-Roberta, etc.) and then try to find the best concatenation of those. The Authors mentioned using 42 GPU-Hours for training and the prediction speed will likely be low when using several transformers at once, but maybe there are some usecases where the straight-up accuracy will be more beneficial.
other changes
- nicer names for some embeddings. WordEmbeddings & FlairEmbeddings had the full path as names. Now it is the filename without ending.
- reintroduced an integer parameter for
reproject_embeddings
in theSequenceTagger
. I don't know why it was removed, but I wanted to use it as the dynamic size would mess up the way I reuse the parameters from model to model. - training won't anneal on dev set when the param_selection_mode is activated (or when no dev set is provided)
@helpmefindaname this is awesome, thanks a lot for adding this!
@lukasgarbas can you review this PR and try to reproduce the ACE results?
Hi @helpmefindaname , many thanks for adding this! I'm also going to test it with HIPE-2022 and our recently released BERT models for historic texts. I'm very excited for the results :)
I think a good setup would also be to use mean of all layers of a BERT model (this should at least decrease dimension size from 4 x hidden size to 1 x hidden size)
Hi @stefan-it thanks, I am also excited to hear about it.
It's important to note, that the authors of ACE achieved the best results by concatenating transformer models that were already fine-tuned on the dataset before hands, while my code example uses only the standard models.
Thanks for pointing out the mean, I corrected the example.
Hi, sorry for getting back so late! I'm currently running some smaller experiments and looking at the code. First, I ran your code snippet on the full CoNLL03 dataset. Here is what the complete snippet looks like:
from flair.embeddings import WordEmbeddings, FlairEmbeddings, TransformerWordEmbeddings
from flair.trainers.ace_trainer import AceTrainer
from flair.datasets import CONLL_03
flair.device = torch.device('cuda:1')
embeddings = [
WordEmbeddings("glove"),
FlairEmbeddings("news-forward"),
FlairEmbeddings("news-backward"),
WordEmbeddings("crawl"),
TransformerWordEmbeddings("bert-base-cased", layers="all", layer_mean=True),
TransformerWordEmbeddings(
"bert-base-cased", layers="all", use_context=True, name="bert-base-cased-context", layer_mean=True
),
TransformerWordEmbeddings("xlm-roberta-base", layers="all", layer_mean=True),
TransformerWordEmbeddings(
"xlm-roberta-base", layers="all", use_context=True, name="xlm-roberta-base-context", layer_mean=True
),
]
corpus = CONLL_03('')
label_type = "ner"
label_dict = corpus.make_label_dictionary(label_type=label_type)
trainer = AceTrainer(
corpus,
embeddings=embeddings,
model_args=dict(
hidden_size=800, tag_dictionary=label_dict, tag_type=label_type, use_crf=True, reproject_embeddings=2048
),
)
model, history = trainer.train("resources/ace", inner_train_args=dict(learning_rate=0.1, mini_batch_size=32, max_epochs=150))
It took 26 hours to complete all 30 episodes on a single RTX 3090 GPU. The best scoring episode was the first one (using all embeddings) 92.8 on test set. However, the best concatenation that was picked by ACE at the end of the training (all embeddings except XLM-R base without context) scored only 92.13 on test. Here is my training log if you want to take a look.
I am now running one more experiment using large fine-tuned models!
One question on final model vs. best model evaluation: it looks like each episode is evaluated using the final model state. Is there an option to evaluate using best models?
Hi @lukasgarbas ,
storing of best model is not performed, because of the following logic in the base trainer class:
https://github.com/flairNLP/flair/blob/3ba253cd7b81ecdd5d023c17b9c0d9ca1bd27219/flair/trainers/trainer.py#L792-L806
The option that prevents saving the best model is the param_selection_mode
because it is set to True
in the ACE trainer implementation:
https://github.com/helpmefindaname/flair/blob/23ad8e2dc4f5ba2bed22704d15d369baaadd63c4/flair/trainers/ace_trainer.py#L284
Hi @lukasgarbas thank you for testing it!
I noticed one implementation detail where I deviated from the original: I made it unable to use the same configuration twice, while the original just ensures that the configuration is not the same as the old one. They then take the maximum score of all runs with the same configuration. A fix is already pushed. Could you try that one out?
Like stefan-it already mentioned, using the best-model is not possible. Besides that, looking at the logs, the last model trained has the highest dev score (0.9607) while having a lower test score (0.9271) than the first episode. So I suppose trying that, would lead to overfitting. I added the evaluation of the final model on the dev set, so we can also compare that one.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello @alanakbik @helpmefindaname, is this PR dropped? It looks like the implementation was almost over and it was a really good idea to integrate it into flair!