haystack find an industry dataset to showcase evaluation metrics

find an industry dataset to showcase evaluation metrics

Open mrm1001 opened this issue 3 months ago • 7 comments

Goal: Showcase the Haystack evaluation metrics on a dataset that is not a toy dataset (e.g. wikipedia articles) but that reflects use cases closer to what our users are working on. The goal of this task is to find an existing "benchmark" dataset (published somewhere already) that comes from an industry use case (e.g. legal, manuals, corporate faqs)

Note: We have the earnings call dataset but we decided against it because we would prefer not to release a new dataset.

Edit 09/04: Could we try to find multiple candidate datasets to be used in tutorial and a separate blog article.

Mar 28 '24 13:03 mrm1001

I've gathered a few annotated datasets that could are fit for this.

Apr 09 '24 11:04 davidsbatista

After catch up with @vblagoje :

pick a min of 2 datasets that look interesting
check their license to make sure we can host a processed version somewhere
do the processing needed to be able to add them to a Haystack RAG pipeline easily (as documents)
add the processed version of these datasets somewhere that is easy to download (for the tutorial)

Apr 09 '24 15:04 mrm1001

@mrm1001 here are the two datasets:

PubMedQA_instruction
- doesn't need any pre-processing and can be used directly
- it has a permissive MIT licence
Law-StackExchange
- needs flatten pre-processing and then can be uploaded to our repo
- it has a permissive CC BY-SA 4.0 licence
- here is the notebook to flatten the dataset and upload it where needed. Here is the flat version

I'll be on the lookout for more datasets so we can use them in tutorials/social

Apr 12 '24 12:04 vblagoje

Hi Vladimir, could we split both datasets into 2:

one has the deduplicated set of documents ready to be loaded into a RAG pipeline
the other one has the set of question/context/answer (like now)

Apr 12 '24 12:04 mrm1001

Here is the notebook that dedups PubMedQA_instruction and here is the deduped dataset version

Here is flat deduped Law-StackExchange

LMK if there is anything else to be done

Apr 12 '24 14:04 vblagoje

Thoughts about how to use these datasets for evaluation:

Pubmed dataset

https://huggingface.co/datasets/vblagoje/PubMedQA_instruction
- it has question (”instruction”), context (”paragraph with the right answer”) and response (”right answer”)
- it is unique by questions (but we might have multiple questions being answered by same context)
  - there are very few duplicates, actually. Could almost be ignored?
- How can this dataset be used to show evaluation
  - use the contexts as source docs
  - use LLM-based evaluation + SAS to compare output answer with the actual answer
  - use doc retrieval metrics to show retrieval (MAP, MRR, recall)
- something to bear in mind: this dataset is very large, it might make sense to downsample it in the tutorial and select rows that have shorter responses (some of them can be very long).

Stack exchange dataset

https://huggingface.co/datasets/vblagoje/Law-StackExchange-Deduplicated
- this ones comes from a legal forum example
- the titles are usually questions
- How can this dataset be used to show evaluation
  - use the "answers" columnn as source docs/contexts, titles as questions, so we know what is the best doc for each question. We could use this to showcase retriever metrics (MAP, MRR, recall).
  - Otherwise I would say we will need to manually create some questions (different from the titles), and then show how to use LLM-based metrics when there are no ground truth answers.

Apr 24 '24 14:04 mrm1001

Also found by @vblagoje :

AllenAI extractive QA dataset

https://huggingface.co/datasets/allenai/ropes
How can this dataset be used to show evaluation
- we can use the "situation" and "question" as a query (it would be a very long query), and then the background is the doc you need to answer that query, and you have a final answer provided. Even if it's extractiveQA, I think you can still show off our new retriever metrics, LLM-based metrics, and also extractiveQA metrics (recall).

Australian Legal QA

https://huggingface.co/datasets/umarbutler/open-australian-legal-qa?row=0
How can this dataset be used to show evaluation
- I think the "prompt" column would need to be processed to extract the snippet. You can use that as docs. Then you have both a question and a long-form answer that you can use a ground-truth answers.

Apr 25 '24 15:04 mrm1001

haystack haystack copied to clipboard

find an industry dataset to showcase evaluation metrics

Pubmed dataset

Stack exchange dataset

haystack
haystack copied to clipboard