FlagEmbedding
FlagEmbedding copied to clipboard
[C-MTEB] How to convert QA dataset to Retrieval & Reranking Dataset
I observed that some datasets such as CmedqaRetrieval, CMedQAv1, CMedQAv2 Built from QA datasets and converted to 'query-pos-neg' format. Do you have 1 instruction for building this data? QA dataset sample:
Instruct:
Output:
Reranking dataset sample:
Query:
Pos:
Neg:
Retrieval dataset sample:
Query:
Context:
Id:
For QA datasets, we use query as query
, and use answer/context as pos
. We use the candidate (except ground truth) provided by the original dataset as neg
.
If there are no candidates for your datasets, you can find some candidates via an embedding model to construct neg
.
For QA datasets, we use query as , and use answer/context as . We use the candidate (except ground truth) provided by the original dataset as .
query``pos``neg
If there are no candidates for your datasets, you can find some candidates via an embedding model to construct .
neg
Thanks for answering, but I have a question if there is a way for me to filter out complex questions (tricky and subtextual questions whose answers are usually not directly related to the question)
A possible method is utilizing GPT to filter these questions. Using the cosine similarity between questions and answers is more simple, but the threshold is difficult to set.