Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Add instruction to reverse augmentation

Open CloseChoice opened this issue 1 year ago • 4 comments

We currently support reverse augmentation for the alpaca datasets. This proved to be not really helpful till now. As mentioned in section 5.1.1 of the paper we should probably generate some additional instruction, e.g. "Please generate a question based on the answer:" in the question part.

CloseChoice avatar Apr 16 '23 06:04 CloseChoice

Hi, I am new to this project and would like to contribute towards it. This seems like something I could pick up.

Can you please help with explaining how should I start working with it?

I guess I have to change something over here i.e. removing reverse_augmentation. Do you have some reference which can help me get started with generating some additional instructions you mentioned in the original issue?

ambujpawar avatar Apr 16 '23 10:04 ambujpawar

I am also new to this project, but here is how I would have started:

I believe you can start by adding adding a new task type

https://github.com/LAION-AI/Open-Assistant/blob/7be14b7e9c6a2e85b8c003728d7cc1126fc9f8d7/oasst-shared/oasst_shared/schemas/protocol.py#L11

and then adding support for it on the front and backend.

Edit:

I did some more digging in the code and supporting reverse augmentation might require a major overhaul of the datastructure used to store conversation trees. Currently messages are stored in a "rooted" tree (every message has one parent and many children) but we would need an "unrooted" tree where a message can have multiple parents.

this would have to get changed to a many-to-many relationship https://github.com/LAION-AI/Open-Assistant/blob/50b55d881c1a92c0fa234ec0e2623e55d7b42b59/backend/oasst_backend/models/message.py#L34

dominiquegarmier avatar Apr 16 '23 12:04 dominiquegarmier

@DominiqueGarmier we would just add this for the datasets that already support the reverse_augmentation keyword for now. Maybe if it yields good results, we can add it in the DatasetEntry class to support it on all datasets. So maybe we could start with the alpaca loader.

I wouldn't add another class here, but just add a couple of possible instructions ["I give you an answer and you find the corresponding question", "Please generate a question based on the answer", "Let's play Jeopardy."] and sample one of those. We do something similar already with e.g. here

@ambujpawar Would you like to work on this too? If so, then we have to figure out who takes this issue. Sorry, guys I somehow missed this issue.

But I am sure, we'll find some other stuff where both of you can help, e.g. https://github.com/LAION-AI/Open-Assistant/issues/2827

CloseChoice avatar Apr 21 '23 22:04 CloseChoice

Ok, with the merge of #2870 some things changed here. Sorry for this but things move quickly here. We need to implement this for the get_formatted method of DatasetEntry class. So maybe adding a test here and checking if questions and answers are reversed would be a good start.

CloseChoice avatar Apr 24 '23 08:04 CloseChoice

I close this for now since reverse-augmentation was causing more harm than it helped.

andreaskoepf avatar May 06 '23 20:05 andreaskoepf