Open-Assistant Create synthetic QA dataset (~1k samples)

Create synthetic QA dataset (~1k samples)

Open andreaskoepf opened this issue 2 years ago • 4 comments

Please create a synthetic QA dialog-dataset with the procedure that you proposed.

We want to present them for ranking in two ways: 1.) ranking different user-questions (initial prompts) 2.) ranking of multiple assistant-answers for a given prompt

Please try to sample multiple completions for a single question/prompt that we can use for 2.

Please store the dataset as json (e.g. storing multiple completions in a list-member of the prompt object).

Dec 28 '22 00:12 andreaskoepf

New to using github. I have attached the notebook for the generation of questions, the correct answers, and the closed book answers using T5. Am working on some ways to generate answers that are longer and of higher quality from multiple models. Generating high quality answers takes some more compute and larger generative models. I have about 18k generated unique questions about 2700 unique topics. Pickle file of dictionary for those questions with self explanatory data structure attached along with code. Continuing to work on this tomorrow to get the level 1 answers to the questions including requests for more information or clarification which may be ready to do a test finetune or human preference rankings. flan_xxl_gpu0_question-answer.zip

Dec 29 '22 04:12 Rallio67

Hey @Rallio67 thank you very much!

I suggest you look up a few short tutorials and "how-to"s about GitHub, especially how to use forks and pull requests to contribute to open-source repositories. It's very easy and super useful!

After that, it would be super cool to have your code as a pull request in our repository. We don't need the actual data, as long as you can give us the code and clear instructions on how to execute it. do you think that's possible?

Dec 29 '22 14:12 yk

Yes I can do that.

Dec 31 '22 00:12 Rallio67

Thank you for all your awesome work @Rallio67 !!

Jan 05 '23 18:01 huu4ontocord

Open-Assistant Open-Assistant copied to clipboard

Create synthetic QA dataset (~1k samples)

Open-Assistant
Open-Assistant copied to clipboard