COVID-QA icon indicating copy to clipboard operation
COVID-QA copied to clipboard

Benchmark & improve different embedding models

Open tholor opened this issue 4 years ago • 6 comments

Plain transformer models (like BERT) are known to produce bad sentence embeddings. From a first, very rough test also Sentence-BERT (https://github.com/UKPLab/sentence-transformers) didn't perform too great on a couple of test queries.

We should evaluate different models once we have the eval dataset from #4 and possibly finetune some on the quora duplicate question dataset (or even a small one created by the crowd).

tholor avatar Mar 20 '20 17:03 tholor

Some sentence embeddings to try:

  • TF-IDF
  • word2vec and fasttext (with pooling and SIF)
  • ELMo embeddings
  • Flair embeddings

andra-pumnea avatar Mar 21 '20 09:03 andra-pumnea

Added a basic eval dataset with #16

tholor avatar Mar 21 '20 09:03 tholor

Added a basic eval script: https://github.com/deepset-ai/COVID-QA/pull/23 Results will be tracked via mlflow: https://public-mlflow.deepset.ai/#/experiments/55

Now running evaluation for a few plain BERT models as a baseline ....

tholor avatar Mar 21 '20 15:03 tholor

Overall, out-of-the-box and with mean, max and min pooling strategies

  • glove, fasttext and flair perform poorly
  • elmo is somehow better, but very slow because the embeddings are generated dynamically.

roc

With fine-tuning on domain data we might see some improvements.

andra-pumnea avatar Mar 21 '20 17:03 andra-pumnea

For fun, just added simple BLEU scoring to the results: https://public-mlflow.deepset.ai/#/experiments/55/runs/be5705cb1ddb4326a10f262732f5bd96 (BLEU is ngram-based (1-4grams) comparison between strings, it can be done on sentence level but with counting +1 and a smoothing factor for the brevity penalty to account for big length differences and potential no-matches)

stedomedo avatar Mar 21 '20 22:03 stedomedo

Added evaluations for some pretrained transformer models in #45.

  • Plain bert-base + sentence-bert (No finetuning done yet)
  • Extracting embeddings from last or second last layer; simple pooling methods
  • Many of them are worse than TF-IDF (kinda expected)
  • sentence-bert with mean pooling from second last layer works best (roc_auc = 0.944, mean_abs_diff:0.365, mlflow)

tholor avatar Mar 22 '20 12:03 tholor