Open-Assistant
Open-Assistant copied to clipboard
Create synthetic QA dataset (~1k samples)
Please create a synthetic QA dialog-dataset with the procedure that you proposed.
We want to present them for ranking in two ways: 1.) ranking different user-questions (initial prompts) 2.) ranking of multiple assistant-answers for a given prompt
Please try to sample multiple completions for a single question/prompt that we can use for 2.
Please store the dataset as json (e.g. storing multiple completions in a list-member of the prompt object).
New to using github. I have attached the notebook for the generation of questions, the correct answers, and the closed book answers using T5. Am working on some ways to generate answers that are longer and of higher quality from multiple models. Generating high quality answers takes some more compute and larger generative models. I have about 18k generated unique questions about 2700 unique topics. Pickle file of dictionary for those questions with self explanatory data structure attached along with code. Continuing to work on this tomorrow to get the level 1 answers to the questions including requests for more information or clarification which may be ready to do a test finetune or human preference rankings. flan_xxl_gpu0_question-answer.zip
Hey @Rallio67 thank you very much!
I suggest you look up a few short tutorials and "how-to"s about GitHub, especially how to use forks and pull requests to contribute to open-source repositories. It's very easy and super useful!
After that, it would be super cool to have your code as a pull request in our repository. We don't need the actual data, as long as you can give us the code and clear instructions on how to execute it. do you think that's possible?
Yes I can do that.
Thank you for all your awesome work @Rallio67 !!