Open-Assistant Implement Dataset Entry for Datasets

We have a new DatasetEntry class which helps us to generalize over datasets and enforce a common formatting. We need to implement this class for a couple of more datasets:

[ ] SODA (@hardikyagnik)
[ ] JokeExplanation (@sampatkalyan, needs approval from CodeOwner)
[ ] WebGPT (@CloseChoice, needs review)
[x] Alpaca (@CloseChoice
[x] AlpacaGpt4 (@CloseChoice)

All of these datasets are found here. An example of how the new class is implemented is found here.

PRs can be implementations for a SINGLE Dataset. If you implement this, then checking the dataset with our script would be nice (but optional):

python check_dataset_appearances.py -d <dataset-name> --cache_dir <cache-dir> --mode sft

If you take up one of the datasets, please mention the dataset name in your comment, then I'll mention you at the corresponding dataset

Apr 21 '23 22:04 CloseChoice

I would like to take up this task.

Apr 22 '23 07:04 hardikyagnik

@hardikyagnik nice, which dataset are you working on? I just ask, so that others can work on the other datasets simultaneously

Apr 22 '23 07:04 CloseChoice

I would like to take up this task.

@hardikyagnik ... I assigned you .. as @CloseChoice said it would be great if you could specify which one you will work on .. potentially others could help (we need this as soon as possible).

Apr 22 '23 11:04 andreaskoepf

Hey @CloseChoice, @andreaskoepf, I will start picking up datasets sequentially starting with SODA.

Apr 23 '23 01:04 hardikyagnik

hi i would also like to work on this issue.

Apr 23 '23 13:04 sampatkalyan

@sampatkalyan, could you work on JokeExplanation?

Apr 23 '23 21:04 CloseChoice

@CloseChoice Yes I will begin the work on the JokeExplanation.

Apr 24 '23 06:04 sampatkalyan

@sampatkalyan , @hardikyagnik did you already look into the implementation? We would really need this asap. If there is anything I can help you with, let me know otherwise I'll take up the task if this is last blocking thing for the next sft run.

Apr 24 '23 19:04 CloseChoice

Hey @CloseChoice , I noticed that in the current SODA implementation, for each dialogue, the first question is updated by adding a narrative field as a suffix, and the entire dialogue is returned as a list of lists of strings. To convert this to a DatasetEntry, would it be okay to separate out the alternate entries in the dialogue into 'questions' and 'answers' fields as members of the DatasetEntry?

Apr 25 '23 04:04 hardikyagnik

@hardikyagnik, the structure for dataset entry for SODA would be

DatasetEntry(
questions=[PREFIX + Q1, Q2, Q3, ...],
answers=[A1, A2, A2, ...]

Apr 25 '23 06:04 CloseChoice

@CloseChoice for JokeExplaniation the DatasetEntry( questions=joke, answers= explanation) correct me if i am wrong.

Apr 25 '23 08:04 sampatkalyan

@sampatkalyan yes, that is correct, please verify that you match the types in the annotations

Apr 25 '23 08:04 CloseChoice

@CloseChoice I am done with my implementation on JokeExplanation. Should I put the pr? And for AlpacaGpt4 class there is a wrong annotation for the dunder method__getitem__ i think it should be DatasetEntry instead of list[str] | tuple[str]. please correct me if i am wrong.

Apr 25 '23 11:04 sampatkalyan

@CloseChoice In JokeExplanation there is a question and answer variable in the init method which is not used or updated anywhere in the code should I remove it or am I missing something here?

Apr 25 '23 11:04 sampatkalyan

@sampatkalyan Please create the PR, I'll have a look.

Apr 25 '23 11:04 CloseChoice

For AlpacaGpt4 should i make the change in seperate pr or the same pr

Apr 25 '23 11:04 sampatkalyan

If it's just a type annotation, put it in the same PR

Apr 25 '23 11:04 CloseChoice

@CloseChoice made the PR please check it. and let me know if there are any changes to be made.

Apr 25 '23 12:04 sampatkalyan

@CloseChoice added suggested changes can you please check? Thank you.

Apr 25 '23 14:04 sampatkalyan

Hello @CloseChoice, I've created the PR, let me know if it requires any updates.

Apr 26 '23 04:04 hardikyagnik

Open-Assistant Open-Assistant copied to clipboard

Implement Dataset Entry for Datasets

Open-Assistant
Open-Assistant copied to clipboard