flair Fine-tuning or extended training of target language in NER few-shot transfer?

Firstly, I would like to thank everyone contributing to this easy-to-use, well structured and open-source framework. I myself am currently writing my bachelor thesis and using during the last weeks daily the flair framework, since I am trying out different languages in zero and few-shot transfer in order to figure out what hinders or not the knowledge transfer on NER task. Please excuse if my questions are for you self-explanatory since I am novice on this particular field.

The below mentioned code, is the training that I have implemented for all languages(all deriving from the WikiANN dataset, downsampled to 20k,10k,10k train, dev, test sets respectively, where applicable). I have tried out both approaches of the FLERT paper and sticked to the second one(Feature-Based) according to my results and the #2732 issue.

Training of the source language (here English):

# import libs
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TransformerWordEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

# set language abbreviation
lan = "en"
# path to dataset splits
data_folder = f"/content/gdrive/MyDrive/data/{lan}"
# set column scheme
columns = {0: "text", 1: "ner"}
# create the corpus
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file="train.txt",
                              test_file="test.txt",
                              dev_file="dev.txt",
                              tag_to_bioes=None,
                              column_delimiter="\t",
                              comment_symbol="#")
# label to predict
label_type = 'ner'
# make the label dictionary from the corpus
label_dict = corpus.make_label_dictionary(label_type=label_type)
# initialize non-fine-tuneable transformer embeddings
embeddings = TransformerWordEmbeddings(model='xlm-roberta-base',
                                       layers="all",
                                       subtoken_pooling="first", 
                                       fine_tune=False,
                                       use_context=True)
# initialize sequence tagger
tagger = SequenceTagger(
                        embeddings=embeddings,
                        tag_dictionary=label_dict,
                        tag_type=label_type,
                        use_rnn=True,
                        hidden_size=256,
                        rnn_layers=1,
                        use_crf=True,
                        reproject_embeddings=True,
                        )
# initialize trainer
trainer = ModelTrainer(tagger, corpus)
# run training
trainer.train(f'/content/gdrive/MyDrive/models/resources/taggers/{lan}/{lan}_tagger',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=300,
              checkpoint=True,
              embeddings_storage_mode="gpu",
              write_weights=True
              )

Question 1 on few-shot:

Now that I have my source language trained, would it be more logical to continue with training on target language with previously used default params (i.e. trainer.train( )) or switch to the trainer.fine_tune() default parameters? Or would be the already acquired knowledge from the source language overwritten in the first approach (.train())? I have tried out both, would nevertheless very much appreciate any opinions on this matter.

Extended training on the target language (here Tamil with 50 sentences as training set):

# set abbreviation of source and target language
source_lan = "en"
target_lan = "ta"
# set column scheme
columns = {0: "text", 1: "ner"}
# path to dataset splits
data_folder = f"/content/gdrive/MyDrive/data/{target_lan}"
# create the corpus
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file="train.txt",
                              test_file="test.txt",
                              dev_file="dev.txt",
                              tag_to_bioes=None,
                              column_delimiter="\t",
                              comment_symbol="#")
# load the model to evaluate
tagger: SequenceTagger = SequenceTagger.load(f'/content/gdrive/MyDrive/models/resources/taggers/{source_lan}/{source_lan}_tagger/best-model.pt') 
# initialize trainer
trainer = ModelTrainer(tagger, corpus)
# run training
trainer.train(f'/content/gdrive/MyDrive/models/resources/taggers/{target_lan}/train_{source_lan}_to_{target_lan}_tagger',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=300,
              checkpoint=True,
              embeddings_storage_mode="gpu",
              )

Fine-tuning on the target language (here Tamil with 50 sentences as training set):

# run fine-tuning
trainer.fine_tune(f'/content/gdrive/MyDrive/models/resources/taggers/{target_lan}/fine_tune_{source_lan}_to_{target_lan}_tagger,
           learning_rate = 5e-5,
           mini_batch_size = 4,
           max_epochs = 10, # also tried with 20, just like in FLERT paper
           embeddings_storage_mode = "gpu"
           )

Question 2 on few-shot:

Since I will be experimenting on different trainingset-sizes of the target languages (50, 100, 500, 1000 sentences), to indicate the performance improvement with more samples, would you recommend to have a fixed dev and test set size and if so what would be a reasonable value? (e.g. 2.8k in dev and test sets each time)

Question 3 on few-shot:

Any recommendations on literature for few-shot techniques?

May 16 '22 16:05 i-partalas

Hello @PartGi thanks for usino Flair! To your questions:

Is hard to answer, but training with a learning rate of 0.1 risks overwriting too much while fine-tuning with a tiny learning rate may not be enough to adapt an RNN to a new language (where word order might be different). I would probably try .train() but set a smaller initial learning rate, like 0.05 or so.
I would definitely fix the test set and make sure that the exact same sentences are in the test set across all experiments. A few thousand sentences should be enough to test. In our experiments in the TARS paper, we used a dev set of the same size as the train set. So 50 train sentences meant 50 dev sentences.
You could check out our TARS-related work. We have few-shot and zero-shot NER models implemented in Flair as well: https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_10_TRAINING_ZERO_SHOT_MODEL.md

May 20 '22 03:05 alanakbik

@alanakbik very much appreciated!

FYI or for others that would try out the same few-shot transfer, extended training on target language with a learning rate of 0.1 outperformed models with learning rate either of 0.05 or 0.02.

May 22 '22 12:05 i-partalas

Cool, thanks for the info!

May 26 '22 08:05 alanakbik

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Oct 15 '22 21:10 stale[bot]

flair flair copied to clipboard

Fine-tuning or extended training of target language in NER few-shot transfer?

Training of the source language (here English):

Question 1 on few-shot:

Extended training on the target language (here Tamil with 50 sentences as training set):

Fine-tuning on the target language (here Tamil with 50 sentences as training set):

Question 2 on few-shot:

Question 3 on few-shot:

flair
flair copied to clipboard