COVID-QA
COVID-QA copied to clipboard
Benchmark & improve different embedding models
Plain transformer models (like BERT) are known to produce bad sentence embeddings. From a first, very rough test also Sentence-BERT (https://github.com/UKPLab/sentence-transformers) didn't perform too great on a couple of test queries.
We should evaluate different models once we have the eval dataset from #4 and possibly finetune some on the quora duplicate question dataset (or even a small one created by the crowd).
Some sentence embeddings to try:
- TF-IDF
- word2vec and fasttext (with pooling and SIF)
- ELMo embeddings
- Flair embeddings
Added a basic eval dataset with #16
Added a basic eval script: https://github.com/deepset-ai/COVID-QA/pull/23 Results will be tracked via mlflow: https://public-mlflow.deepset.ai/#/experiments/55
Now running evaluation for a few plain BERT models as a baseline ....
Overall, out-of-the-box and with mean, max and min pooling strategies
- glove, fasttext and flair perform poorly
- elmo is somehow better, but very slow because the embeddings are generated dynamically.
With fine-tuning on domain data we might see some improvements.
For fun, just added simple BLEU scoring to the results: https://public-mlflow.deepset.ai/#/experiments/55/runs/be5705cb1ddb4326a10f262732f5bd96 (BLEU is ngram-based (1-4grams) comparison between strings, it can be done on sentence level but with counting +1 and a smoothing factor for the brevity penalty to account for big length differences and potential no-matches)
Added evaluations for some pretrained transformer models in #45.
- Plain bert-base + sentence-bert (No finetuning done yet)
- Extracting embeddings from last or second last layer; simple pooling methods
- Many of them are worse than TF-IDF (kinda expected)
- sentence-bert with mean pooling from second last layer works best (roc_auc = 0.944, mean_abs_diff:0.365, mlflow)