Tutorial 09: Update to EmbeddingRetriever Training

Open bglearning opened this issue 3 years ago • 14 comments

Overview

With #2887, we replaced DPR with EmbeddingRetriever in Tutorial 06.

Now, we might want to do the same for Tutorial 09 which covers training (or fine-tuning) a DPR Retriever model.

Q1. Should we go ahead with this switch? Any reason keeping DPR might be better?

Alternatively, we could create one for each. I guess depends on which we want to demonstrate plus what we think might be valuable for users.

Training EmbeddingRetriever

Only the sentence-encoder variant of EmbeddingRetriever can be trained.

Its train method does some data setup and then calls the fit method on SentenceTransformer (from the sentence_transformers package).

Input data format is:

[
{”question”: …, “pos_doc”: …, “neg_doc”: …, “score”: …}, 
... 
]

It uses MarginMSELoss (as part of the GPL procedure).

Q2. If we were to demonstrate its training, which data could be best to use? GPL et al. seem to use MSMARCO but then we need cross-encoder scores for the score above, right? So there doesn't seem to be a download-and-use form of dataset available?

RFC: @brandenchan @vblagoje @agnieszka-m (please loop in anyone else if necessary) cc: @mkkuemmel

Aug 15 '22 13:08 bglearning