FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

[C-MTEB] How to convert QA dataset to Retrieval & Reranking Dataset

Open Iambestfeed opened this issue 11 months ago • 3 comments

I observed that some datasets such as CmedqaRetrieval, CMedQAv1, CMedQAv2 Built from QA datasets and converted to 'query-pos-neg' format. Do you have 1 instruction for building this data? QA dataset sample:

Instruct:
Output:

Reranking dataset sample:

Query:
Pos:
Neg:

Retrieval dataset sample:

Query:
Context:
Id:

Iambestfeed avatar Feb 28 '24 05:02 Iambestfeed

For QA datasets, we use query as query, and use answer/context as pos. We use the candidate (except ground truth) provided by the original dataset as neg.

If there are no candidates for your datasets, you can find some candidates via an embedding model to construct neg.

staoxiao avatar Feb 28 '24 12:02 staoxiao

For QA datasets, we use query as , and use answer/context as . We use the candidate (except ground truth) provided by the original dataset as .query``pos``neg

If there are no candidates for your datasets, you can find some candidates via an embedding model to construct .neg

Thanks for answering, but I have a question if there is a way for me to filter out complex questions (tricky and subtextual questions whose answers are usually not directly related to the question)

Iambestfeed avatar Feb 28 '24 12:02 Iambestfeed

A possible method is utilizing GPT to filter these questions. Using the cosine similarity between questions and answers is more simple, but the threshold is difficult to set.

staoxiao avatar Feb 29 '24 09:02 staoxiao