COVID-QA Benchmark & improve different embedding models

Plain transformer models (like BERT) are known to produce bad sentence embeddings. From a first, very rough test also Sentence-BERT (https://github.com/UKPLab/sentence-transformers) didn't perform too great on a couple of test queries.

We should evaluate different models once we have the eval dataset from #4 and possibly finetune some on the quora duplicate question dataset (or even a small one created by the crowd).

Mar 20 '20 17:03 tholor

Some sentence embeddings to try:

TF-IDF
word2vec and fasttext (with pooling and SIF)
ELMo embeddings
Flair embeddings

Mar 21 '20 09:03 andra-pumnea

Added a basic eval dataset with #16

Mar 21 '20 09:03 tholor

Added a basic eval script: https://github.com/deepset-ai/COVID-QA/pull/23 Results will be tracked via mlflow: https://public-mlflow.deepset.ai/#/experiments/55

Now running evaluation for a few plain BERT models as a baseline ....

Mar 21 '20 15:03 tholor

Overall, out-of-the-box and with mean, max and min pooling strategies

glove, fasttext and flair perform poorly
elmo is somehow better, but very slow because the embeddings are generated dynamically.

roc

With fine-tuning on domain data we might see some improvements.

Mar 21 '20 17:03 andra-pumnea

For fun, just added simple BLEU scoring to the results: https://public-mlflow.deepset.ai/#/experiments/55/runs/be5705cb1ddb4326a10f262732f5bd96 (BLEU is ngram-based (1-4grams) comparison between strings, it can be done on sentence level but with counting +1 and a smoothing factor for the brevity penalty to account for big length differences and potential no-matches)

Mar 21 '20 22:03 stedomedo

Added evaluations for some pretrained transformer models in #45.

Plain bert-base + sentence-bert (No finetuning done yet)
Extracting embeddings from last or second last layer; simple pooling methods
Many of them are worse than TF-IDF (kinda expected)
sentence-bert with mean pooling from second last layer works best (roc_auc = 0.944, mean_abs_diff:0.365, mlflow)

Mar 22 '20 12:03 tholor

COVID-QA COVID-QA copied to clipboard

Benchmark & improve different embedding models

COVID-QA
COVID-QA copied to clipboard