Open-Assistant
Open-Assistant copied to clipboard
Creating augmented data using few-shot prompts for explanations of jokes, logical inferences, etc.
See https://www.lesswrong.com/posts/EHbJ69JDs4suovpLw/testing-palm-prompts-on-gpt3.
Try doing 2, 3 or 4 shot inference on something like JT or neox 20B or galactica.
After we find a promising model and configuration, we can scrape the net for jokes and paragraphs with logical inferences to create dialog data.
Human: Tell me a joke about {extract keywords from joke} Assistant: {joke} Human: Explain the joke. Assisant: {explanation}
See also https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf
Adding expalantions at the end of existing instruction dataset answers where the answers are classificaitons (see p3, natural instructions, etc):
For exmple,
This is a movie review for the movie {movie}: {review}. This movie review is {classifciaiton} because ...[your created answer]
This is a movie review for the movie {movie}: {review}. This movie review is {classifciaiton} because ...[your created answer]
This is a movie review for the movie {movie}: {review}. This movie review is {classifciaiton} because ...generated answer
We can also follow this up with explanations for other "hard" things like:
explain riddles, poems (metaphors), analogies, songs
Going with the movie reviews idea, could we use the Rotten Tomatoes dataset to generate prompts, maybe supplement with one of the models fine tuned on it as well?
https://huggingface.co/datasets/rotten_tomatoes
The idea is to create a dataset with explanations. Like for example take the movie dataset and do this:
This is a movie review for the movie {movie}: {review}. This movie review is {classifciaiton} because ...[your created answer]
Am I right?
I'm interested in picking this up. How large should the dataset be?
@momegas yes. if it is very compute intensive, it doesn't need to be large. maybe see if you can get it to work first. And then we can discuss size. we can run it on some extra compute.
Sound like a very cool task and I would love to give it a try if it is still relevant :) @ontocord
@ontocord I'd like to have a try, can you tell me your name in Discord? Maybe we can talk a little bit more there. @mikegarts Maybe we can work on it together? More data is better for this project. My name in Discord is QiKo
@kkie02 Sure, I'm in discord as mikegarts. Feel free to ping me. Btw I just opened a pr with somewhat relevant instruction dataset https://github.com/LAION-AI/Open-Assistant/pull/2209 but would love to cooperate on further work.
Going to work in this field, but with more specific tasks (semantics, logic, reasoning) https://github.com/LAION-AI/Open-Assistant/issues/3122
Closing old data issue.