Open-Assistant
Open-Assistant copied to clipboard
Add instruction to reverse augmentation
We currently support reverse augmentation for the alpaca datasets. This proved to be not really helpful till now. As mentioned in section 5.1.1 of the paper we should probably generate some additional instruction, e.g. "Please generate a question based on the answer:" in the question part.
Hi, I am new to this project and would like to contribute towards it. This seems like something I could pick up.
Can you please help with explaining how should I start working with it?
I guess I have to change something over here i.e. removing reverse_augmentation. Do you have some reference which can help me get started with generating some additional instructions you mentioned in the original issue?
I am also new to this project, but here is how I would have started:
I believe you can start by adding adding a new task type
https://github.com/LAION-AI/Open-Assistant/blob/7be14b7e9c6a2e85b8c003728d7cc1126fc9f8d7/oasst-shared/oasst_shared/schemas/protocol.py#L11
and then adding support for it on the front and backend.
Edit:
I did some more digging in the code and supporting reverse augmentation might require a major overhaul of the datastructure used to store conversation trees. Currently messages are stored in a "rooted" tree (every message has one parent and many children) but we would need an "unrooted" tree where a message can have multiple parents.
this would have to get changed to a many-to-many relationship https://github.com/LAION-AI/Open-Assistant/blob/50b55d881c1a92c0fa234ec0e2623e55d7b42b59/backend/oasst_backend/models/message.py#L34
@DominiqueGarmier we would just add this for the datasets that already support the reverse_augmentation
keyword for now. Maybe if it yields good results, we can add it in the DatasetEntry
class to support it on all datasets. So maybe we could start with the alpaca loader.
I wouldn't add another class here, but just add a couple of possible instructions ["I give you an answer and you find the corresponding question", "Please generate a question based on the answer", "Let's play Jeopardy."]
and sample one of those. We do something similar already with e.g. here
@ambujpawar Would you like to work on this too? If so, then we have to figure out who takes this issue. Sorry, guys I somehow missed this issue.
But I am sure, we'll find some other stuff where both of you can help, e.g. https://github.com/LAION-AI/Open-Assistant/issues/2827
Ok, with the merge of #2870 some things changed here. Sorry for this but things move quickly here. We need to implement this for the get_formatted
method of DatasetEntry
class. So maybe adding a test here and checking if questions and answers are reversed would be a good start.
I close this for now since reverse-augmentation was causing more harm than it helped.