haystack icon indicating copy to clipboard operation
haystack copied to clipboard

find an industry dataset to showcase evaluation metrics

Open mrm1001 opened this issue 3 months ago • 7 comments

Goal: Showcase the Haystack evaluation metrics on a dataset that is not a toy dataset (e.g. wikipedia articles) but that reflects use cases closer to what our users are working on. The goal of this task is to find an existing "benchmark" dataset (published somewhere already) that comes from an industry use case (e.g. legal, manuals, corporate faqs)

Note: We have the earnings call dataset but we decided against it because we would prefer not to release a new dataset.

Edit 09/04: Could we try to find multiple candidate datasets to be used in tutorial and a separate blog article.

mrm1001 avatar Mar 28 '24 13:03 mrm1001

I've gathered a few annotated datasets that could are fit for this.

davidsbatista avatar Apr 09 '24 11:04 davidsbatista

After catch up with @vblagoje :

  • pick a min of 2 datasets that look interesting
  • check their license to make sure we can host a processed version somewhere
  • do the processing needed to be able to add them to a Haystack RAG pipeline easily (as documents)
  • add the processed version of these datasets somewhere that is easy to download (for the tutorial)

mrm1001 avatar Apr 09 '24 15:04 mrm1001

@mrm1001 here are the two datasets:

  • PubMedQA_instruction

    • doesn't need any pre-processing and can be used directly
    • it has a permissive MIT licence
  • Law-StackExchange

    • needs flatten pre-processing and then can be uploaded to our repo
    • it has a permissive CC BY-SA 4.0 licence
    • here is the notebook to flatten the dataset and upload it where needed. Here is the flat version

I'll be on the lookout for more datasets so we can use them in tutorials/social

vblagoje avatar Apr 12 '24 12:04 vblagoje

Hi Vladimir, could we split both datasets into 2:

  • one has the deduplicated set of documents ready to be loaded into a RAG pipeline
  • the other one has the set of question/context/answer (like now)

mrm1001 avatar Apr 12 '24 12:04 mrm1001

Here is the notebook that dedups PubMedQA_instruction and here is the deduped dataset version

Here is flat deduped Law-StackExchange

LMK if there is anything else to be done

vblagoje avatar Apr 12 '24 14:04 vblagoje

Thoughts about how to use these datasets for evaluation:

Pubmed dataset

  • https://huggingface.co/datasets/vblagoje/PubMedQA_instruction
    • it has question (”instruction”), context (”paragraph with the right answer”) and response (”right answer”)
    • it is unique by questions (but we might have multiple questions being answered by same context)
    • How can this dataset be used to show evaluation
      • use the contexts as source docs
      • use LLM-based evaluation + SAS to compare output answer with the actual answer
      • use doc retrieval metrics to show retrieval (MAP, MRR, recall)
    • something to bear in mind: this dataset is very large, it might make sense to downsample it in the tutorial and select rows that have shorter responses (some of them can be very long).

Stack exchange dataset

  • https://huggingface.co/datasets/vblagoje/Law-StackExchange-Deduplicated
    • this ones comes from a legal forum example
    • the titles are usually questions
    • How can this dataset be used to show evaluation
      • use the "answers" columnn as source docs/contexts, titles as questions, so we know what is the best doc for each question. We could use this to showcase retriever metrics (MAP, MRR, recall).
      • Otherwise I would say we will need to manually create some questions (different from the titles), and then show how to use LLM-based metrics when there are no ground truth answers.

mrm1001 avatar Apr 24 '24 14:04 mrm1001

Also found by @vblagoje :

AllenAI extractive QA dataset

  • https://huggingface.co/datasets/allenai/ropes
  • How can this dataset be used to show evaluation
    • we can use the "situation" and "question" as a query (it would be a very long query), and then the background is the doc you need to answer that query, and you have a final answer provided. Even if it's extractiveQA, I think you can still show off our new retriever metrics, LLM-based metrics, and also extractiveQA metrics (recall).

Australian Legal QA

  • https://huggingface.co/datasets/umarbutler/open-australian-legal-qa?row=0
  • How can this dataset be used to show evaluation
    • I think the "prompt" column would need to be processed to extract the snippet. You can use that as docs. Then you have both a question and a long-form answer that you can use a ground-truth answers.

mrm1001 avatar Apr 25 '24 15:04 mrm1001