how was the hotpot_qa dataset preprocessed?

Open DanielSchuhmacher opened this issue 1 year ago • 0 comments

I am curious how you created the list of documents (the corpus). The original hotpot_qa does not come with that list of documents. Instead for each query it comes with a list of only 10 documents - 2 documents with the content for the gold answer and 8 distractor documents. My current assumption is the following. you took the distractor dataset and extracted the documents for all queries to build the corpus. The 2 gold documents in the original hotpotqa were then marked as the relevant documents for a specific query.

Please let me know how it works, since this confuses me quite a lot. Thank you very much!

If my assumption is correct you could also have a look at the multi-hop-rag dataset which was specifically created in that format already (corpus is seperated from the query and answer). The documents are also longer, which I think is a more realistic use case for a retrieval system, specially a RAG system.

Sep 02 '24 14:09 DanielSchuhmacher