haystack
haystack copied to clipboard
find an industry dataset to showcase evaluation metrics
Goal: Showcase the Haystack evaluation metrics on a dataset that is not a toy dataset (e.g. wikipedia articles) but that reflects use cases closer to what our users are working on. The goal of this task is to find an existing "benchmark" dataset (published somewhere already) that comes from an industry use case (e.g. legal, manuals, corporate faqs)
Note: We have the earnings call dataset but we decided against it because we would prefer not to release a new dataset.
Edit 09/04: Could we try to find multiple candidate datasets to be used in tutorial and a separate blog article.
I've gathered a few annotated datasets that could are fit for this.
After catch up with @vblagoje :
- pick a min of 2 datasets that look interesting
- check their license to make sure we can host a processed version somewhere
- do the processing needed to be able to add them to a Haystack RAG pipeline easily (as documents)
- add the processed version of these datasets somewhere that is easy to download (for the tutorial)
@mrm1001 here are the two datasets:
-
- doesn't need any pre-processing and can be used directly
- it has a permissive MIT licence
I'll be on the lookout for more datasets so we can use them in tutorials/social
Hi Vladimir, could we split both datasets into 2:
- one has the deduplicated set of documents ready to be loaded into a RAG pipeline
- the other one has the set of question/context/answer (like now)
Here is the notebook that dedups PubMedQA_instruction and here is the deduped dataset version
Here is flat deduped Law-StackExchange
LMK if there is anything else to be done
Thoughts about how to use these datasets for evaluation:
Pubmed dataset
- https://huggingface.co/datasets/vblagoje/PubMedQA_instruction
- it has question (”instruction”), context (”paragraph with the right answer”) and response (”right answer”)
- it is unique by questions (but we might have multiple questions being answered by same context)
- there are very few duplicates, actually. Could almost be ignored?
-
How can this dataset be used to show evaluation
- use the contexts as source docs
- use LLM-based evaluation + SAS to compare output answer with the actual answer
- use doc retrieval metrics to show retrieval (MAP, MRR, recall)
- something to bear in mind: this dataset is very large, it might make sense to downsample it in the tutorial and select rows that have shorter responses (some of them can be very long).
Stack exchange dataset
- https://huggingface.co/datasets/vblagoje/Law-StackExchange-Deduplicated
- this ones comes from a legal forum example
- the titles are usually questions
-
How can this dataset be used to show evaluation
- use the "answers" columnn as source docs/contexts, titles as questions, so we know what is the best doc for each question. We could use this to showcase retriever metrics (MAP, MRR, recall).
- Otherwise I would say we will need to manually create some questions (different from the titles), and then show how to use LLM-based metrics when there are no ground truth answers.
Also found by @vblagoje :
AllenAI extractive QA dataset
- https://huggingface.co/datasets/allenai/ropes
-
How can this dataset be used to show evaluation
- we can use the "situation" and "question" as a query (it would be a very long query), and then the background is the doc you need to answer that query, and you have a final answer provided. Even if it's extractiveQA, I think you can still show off our new retriever metrics, LLM-based metrics, and also extractiveQA metrics (recall).
Australian Legal QA
- https://huggingface.co/datasets/umarbutler/open-australian-legal-qa?row=0
-
How can this dataset be used to show evaluation
- I think the "prompt" column would need to be processed to extract the snippet. You can use that as docs. Then you have both a question and a long-form answer that you can use a ground-truth answers.