flair icon indicating copy to clipboard operation
flair copied to clipboard

Automated Concatenation of Embeddings

Open helpmefindaname opened this issue 2 years ago • 7 comments

adds ACE, from this repo: https://github.com/Alibaba-NLP/ACE currently not tested on larger training, but at least it runs through when using CONLL_03 downsampled to 0.005 on each dataset.

Inner working:

while the original masks out embeddings, I decided to precompute them once at the start and create several models which different concatinations. As StackedEmbeddings overwrite the embedding name, I cannot use it. So AceEmbeddings act as fake dummy to load the precomputed embeddings with the right size. For each "Episode", a new Model is created which will be initialized with the parameters of the previous models. Only the imput weight will be modified to only use the nodes connected to the respective embeddings. (e.g. if you trained on a size of 7500 as full embedding, the first layer will be of size 7500*x. The Second training only uses embeddings for size 3750, so only parts of the weight of the first layer will be used (3750*x)) At the end, a final model will be created and returned by the trainer.train method.

Usage on CONLL:

Currently, I use the following for training. (besides downsample the CONLL dataset). If someone has time and resources to try to reproduce the actual results of the ACE and share results, this would be very welcomed :-)

embeddings = [
        WordEmbeddings("glove"),
        FlairEmbeddings("news-forward"),
        FlairEmbeddings("news-backward"),
        WordEmbeddings("crawl"),
        TransformerWordEmbeddings("bert-base-cased", layers="-1,-2,-3,-4", layer_mean=True),
        TransformerWordEmbeddings(
            "bert-base-cased", layers="-1,-2,-3,-4", use_context=True, name="bert-base-cased-context", layer_mean=True
        ),
        TransformerWordEmbeddings("xlm-roberta-base", layers="-1,-2,-3,-4", layer_mean=True),
        TransformerWordEmbeddings(
            "xlm-roberta-base", layers="-1,-2,-3,-4", use_context=True, name="xlm-roberta-base-context", layer_mean=True
        ),
    ]
    corpus = CONLL_03()
    label_type = "ner"
    label_dict = corpus.make_label_dictionary(label_type=label_type)

    trainer = AceTrainer(
        corpus,
        embeddings=embeddings,
        model_args=dict(
            hidden_size=800, tag_dictionary=label_dict, tag_type=label_type, use_crf=True, reproject_embeddings=2048
        ),
    )

    model, history = trainer.train("resources/ace", inner_train_args=dict(learning_rate=0.1, mini_batch_size=32, max_epochs=150))

considerations

The way it is currently implemented, the Embeddings are be expected to be precomputable and stored on cpu ram. This leads to problems (for strange reasons also for my GPU ram). I will be testing using only half of the training set of CONLL_03.

I am currently not sure how good the practical use really is. It is neat if you have much ram and lots of time, that you can finetune plenty of FLERT-like models (Bert, Roberta, Deberta, XLM-Roberta, etc.) and then try to find the best concatenation of those. The Authors mentioned using 42 GPU-Hours for training and the prediction speed will likely be low when using several transformers at once, but maybe there are some usecases where the straight-up accuracy will be more beneficial.

other changes

  • nicer names for some embeddings. WordEmbeddings & FlairEmbeddings had the full path as names. Now it is the filename without ending.
  • reintroduced an integer parameter for reproject_embeddings in the SequenceTagger. I don't know why it was removed, but I wanted to use it as the dynamic size would mess up the way I reuse the parameters from model to model.
  • training won't anneal on dev set when the param_selection_mode is activated (or when no dev set is provided)

helpmefindaname avatar May 22 '22 19:05 helpmefindaname

@helpmefindaname this is awesome, thanks a lot for adding this!

@lukasgarbas can you review this PR and try to reproduce the ACE results?

alanakbik avatar May 24 '22 09:05 alanakbik

Hi @helpmefindaname , many thanks for adding this! I'm also going to test it with HIPE-2022 and our recently released BERT models for historic texts. I'm very excited for the results :)

stefan-it avatar May 25 '22 06:05 stefan-it

I think a good setup would also be to use mean of all layers of a BERT model (this should at least decrease dimension size from 4 x hidden size to 1 x hidden size)

stefan-it avatar May 25 '22 07:05 stefan-it

Hi @stefan-it thanks, I am also excited to hear about it.

It's important to note, that the authors of ACE achieved the best results by concatenating transformer models that were already fine-tuned on the dataset before hands, while my code example uses only the standard models.

Thanks for pointing out the mean, I corrected the example.

helpmefindaname avatar May 25 '22 16:05 helpmefindaname

Hi, sorry for getting back so late! I'm currently running some smaller experiments and looking at the code. First, I ran your code snippet on the full CoNLL03 dataset. Here is what the complete snippet looks like:

from flair.embeddings import WordEmbeddings, FlairEmbeddings, TransformerWordEmbeddings
from flair.trainers.ace_trainer import AceTrainer
from flair.datasets import CONLL_03

flair.device = torch.device('cuda:1')

embeddings = [
        WordEmbeddings("glove"),
        FlairEmbeddings("news-forward"),
        FlairEmbeddings("news-backward"),
        WordEmbeddings("crawl"),
        TransformerWordEmbeddings("bert-base-cased", layers="all", layer_mean=True),
        TransformerWordEmbeddings(
            "bert-base-cased", layers="all", use_context=True, name="bert-base-cased-context", layer_mean=True
        ),
        TransformerWordEmbeddings("xlm-roberta-base", layers="all", layer_mean=True),
        TransformerWordEmbeddings(
            "xlm-roberta-base", layers="all", use_context=True, name="xlm-roberta-base-context", layer_mean=True
        ),
    ]
    
corpus = CONLL_03('') 
label_type = "ner"
label_dict = corpus.make_label_dictionary(label_type=label_type)

trainer = AceTrainer(
        corpus,
        embeddings=embeddings,
        model_args=dict(
            hidden_size=800, tag_dictionary=label_dict, tag_type=label_type, use_crf=True, reproject_embeddings=2048
        ),
)

model, history = trainer.train("resources/ace", inner_train_args=dict(learning_rate=0.1, mini_batch_size=32, max_epochs=150))

It took 26 hours to complete all 30 episodes on a single RTX 3090 GPU. The best scoring episode was the first one (using all embeddings) 92.8 on test set. However, the best concatenation that was picked by ACE at the end of the training (all embeddings except XLM-R base without context) scored only 92.13 on test. Here is my training log if you want to take a look.

I am now running one more experiment using large fine-tuned models!

One question on final model vs. best model evaluation: it looks like each episode is evaluated using the final model state. Is there an option to evaluate using best models?

lukasgarbas avatar Jun 24 '22 08:06 lukasgarbas

Hi @lukasgarbas ,

storing of best model is not performed, because of the following logic in the base trainer class:

https://github.com/flairNLP/flair/blob/3ba253cd7b81ecdd5d023c17b9c0d9ca1bd27219/flair/trainers/trainer.py#L792-L806

The option that prevents saving the best model is the param_selection_mode because it is set to True in the ACE trainer implementation:

https://github.com/helpmefindaname/flair/blob/23ad8e2dc4f5ba2bed22704d15d369baaadd63c4/flair/trainers/ace_trainer.py#L284

stefan-it avatar Jun 24 '22 09:06 stefan-it

Hi @lukasgarbas thank you for testing it!

I noticed one implementation detail where I deviated from the original: I made it unable to use the same configuration twice, while the original just ensures that the configuration is not the same as the old one. They then take the maximum score of all runs with the same configuration. A fix is already pushed. Could you try that one out?

Like stefan-it already mentioned, using the best-model is not possible. Besides that, looking at the logs, the last model trained has the highest dev score (0.9607) while having a lower test score (0.9271) than the first episode. So I suppose trying that, would lead to overfitting. I added the evaluation of the final model on the dev set, so we can also compare that one.

helpmefindaname avatar Jun 24 '22 15:06 helpmefindaname

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 29 '22 02:10 stale[bot]

Hello @alanakbik @helpmefindaname, is this PR dropped? It looks like the implementation was almost over and it was a really good idea to integrate it into flair!

mauryaland avatar Jun 22 '23 13:06 mauryaland