Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

convert scrolls dataset to multipart dialog

Open huu4ontocord opened this issue 2 years ago • 3 comments

The scrolls dataset is interesting because it is a long text dataset. the task is to experiment with breaking the information into a dialog paths, including the final task (qa, summarization, nli). So instruction->answer, instruction->answer, ... final instruction->answer. ideally we would like to prime the assitant to be able to reason over long dialog paths.

https://www.scrolls-benchmark.com/

huu4ontocord avatar Jan 09 '23 04:01 huu4ontocord

Is there anywhere I can look up how datasets are supposed to be structured for open-assistant?

I saw the deck discussing the conversation tree data structure, but I'm not sure how we actually want the datasets formatted / structured.

Do we have any examples I can use as a reference?

beegieb avatar Feb 07 '23 23:02 beegieb

I assigned to you. You can do it kinda like this:

User: I'm reading a [article|story] and need to summarize it. Can you help me with reading parts of it and then help with the summary?

Here is the introduction: {text} Assistant: I'm happy to help. This introduction is about {either get the summary from the answer or generate it using t5-large} User: I need to also understand this part: {next section} What do you think about ... {generated topic question using question generator-answer} Assistant: This section is above XYZ. In answer to your question about {topic}, {answer} ... User: Now summarize all the parts above into a coheren final summary. Assistant: The final summary is: {actual summary from the scrolls dataset}

You can do variations of these types of dialog paths in order to simulate long range tasks.

huu4ontocord avatar Feb 07 '23 23:02 huu4ontocord

These seems to be an issue of the time before OIG was created. Was this task completed as part of OIG?

andreaskoepf avatar May 05 '23 11:05 andreaskoepf