Open-Assistant
Open-Assistant copied to clipboard
Uni Freiburg QA Datasets Collection
The University of Freiburg's Algorithms and Data Structures Group has a nice repository with an overview of several QA datasets:
https://github.com/ad-freiburg/large-qa-datasets
Example entry: TriviaQA Joshi et al. PDF: https://www.aclweb.org/anthology/P17-1147.pdf Dataset: http://nlp.cs.washington.edu/triviaqa/ Year of Publication: 2017 Size: ca. 95,000 Data Collection: Joshi et al. collect question-answer pairs from 14 trivia websites. Additionally, they gather textual evidence for the given answer from Web Search results and Wikipedia articles.
I think many of these could make great data sources for OpenAssistant.
I would be open to taking the task on of converting a couple of them to usable datasets.
I would make a notebook and markdown file in the notebook/argumentation folder for each dataset and use the Open-Assistant Data Scheme.
Yes - please do. to the extent these aren't already in p3/xp3, would be good to get them in qa format. also think how to convert them into multiple part instruction->answer, instruction->answer paths. please discuss in discord any issues and to coordinate with others doing qa -> assistant dialog.
also, unifiedqa has alot of qa already included.
Great!
- I'll check out UnifiedQA (I am assuming from this paper https://arxiv.org/abs/2005.00700)
- Try to get mutlipart instructions as often as possible
I'll timebox time to work on this on the weekend.
Quick q though: what is p3/xp3? Just want to make sure I dont start working on the wrong dataset.
https://github.com/allenai/unifiedqa https://huggingface.co/datasets/Muennighoff/P3 https://huggingface.co/datasets/bigscience/xP3
you can find out what dataset was used to create p3 and xp3 either in their paper or in the downloader itself. https://arxiv.org/pdf/2211.01786.pdf
if it's a pain to figure out the overalp, just convert the whole unifiedqa dataset. if you have a script that should be pretty straight forward. we can deal with dedup later.
You want to work on converting some of the questions to instructions, so we can vary the types of things asked of the bot. We can open up a new issue.
Yes sure I can work on expanding the notebook that I made to nut just download and change the schema of the data, but also make it into more of an instruction dataset.
I am thinking of doing it with a template plus using a model to get it to rephrase my templates into as many variants as possible.
Ill close this issue and make a new one.