Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Uni Freiburg QA Datasets Collection

Open vale95ntino opened this issue 2 years ago • 2 comments

The University of Freiburg's Algorithms and Data Structures Group has a nice repository with an overview of several QA datasets:

https://github.com/ad-freiburg/large-qa-datasets

Example entry: TriviaQA Joshi et al. PDF: https://www.aclweb.org/anthology/P17-1147.pdf Dataset: http://nlp.cs.washington.edu/triviaqa/ Year of Publication: 2017 Size: ca. 95,000 Data Collection: Joshi et al. collect question-answer pairs from 14 trivia websites. Additionally, they gather textual evidence for the given answer from Web Search results and Wikipedia articles.

I think many of these could make great data sources for OpenAssistant.

I would be open to taking the task on of converting a couple of them to usable datasets.

I would make a notebook and markdown file in the notebook/argumentation folder for each dataset and use the Open-Assistant Data Scheme.

vale95ntino avatar Jan 09 '23 03:01 vale95ntino

Yes - please do. to the extent these aren't already in p3/xp3, would be good to get them in qa format. also think how to convert them into multiple part instruction->answer, instruction->answer paths. please discuss in discord any issues and to coordinate with others doing qa -> assistant dialog.

huu4ontocord avatar Jan 09 '23 04:01 huu4ontocord

also, unifiedqa has alot of qa already included.

huu4ontocord avatar Jan 09 '23 04:01 huu4ontocord

Great!

  • I'll check out UnifiedQA (I am assuming from this paper https://arxiv.org/abs/2005.00700)
  • Try to get mutlipart instructions as often as possible

I'll timebox time to work on this on the weekend.

Quick q though: what is p3/xp3? Just want to make sure I dont start working on the wrong dataset.

vale95ntino avatar Jan 10 '23 20:01 vale95ntino

https://github.com/allenai/unifiedqa https://huggingface.co/datasets/Muennighoff/P3 https://huggingface.co/datasets/bigscience/xP3

you can find out what dataset was used to create p3 and xp3 either in their paper or in the downloader itself. https://arxiv.org/pdf/2211.01786.pdf

if it's a pain to figure out the overalp, just convert the whole unifiedqa dataset. if you have a script that should be pretty straight forward. we can deal with dedup later.

huu4ontocord avatar Jan 12 '23 06:01 huu4ontocord

You want to work on converting some of the questions to instructions, so we can vary the types of things asked of the bot. We can open up a new issue.

huu4ontocord avatar Jan 14 '23 08:01 huu4ontocord

Yes sure I can work on expanding the notebook that I made to nut just download and change the schema of the data, but also make it into more of an instruction dataset.

I am thinking of doing it with a template plus using a model to get it to rephrase my templates into as many variants as possible.

Ill close this issue and make a new one.

vale95ntino avatar Jan 14 '23 17:01 vale95ntino