Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Implement Dataset Entry for Datasets

Open CloseChoice opened this issue 1 year ago • 20 comments

We have a new DatasetEntry class which helps us to generalize over datasets and enforce a common formatting. We need to implement this class for a couple of more datasets:

  • [ ] SODA (@hardikyagnik)
  • [ ] JokeExplanation (@sampatkalyan, needs approval from CodeOwner)
  • [ ] WebGPT (@CloseChoice, needs review)
  • [x] Alpaca (@CloseChoice
  • [x] AlpacaGpt4 (@CloseChoice)

All of these datasets are found here. An example of how the new class is implemented is found here.

PRs can be implementations for a SINGLE Dataset. If you implement this, then checking the dataset with our script would be nice (but optional):

python check_dataset_appearances.py -d <dataset-name> --cache_dir <cache-dir> --mode sft

If you take up one of the datasets, please mention the dataset name in your comment, then I'll mention you at the corresponding dataset

CloseChoice avatar Apr 21 '23 22:04 CloseChoice

I would like to take up this task.

hardikyagnik avatar Apr 22 '23 07:04 hardikyagnik

@hardikyagnik nice, which dataset are you working on? I just ask, so that others can work on the other datasets simultaneously

CloseChoice avatar Apr 22 '23 07:04 CloseChoice

I would like to take up this task.

@hardikyagnik ... I assigned you .. as @CloseChoice said it would be great if you could specify which one you will work on .. potentially others could help (we need this as soon as possible).

andreaskoepf avatar Apr 22 '23 11:04 andreaskoepf

Hey @CloseChoice, @andreaskoepf, I will start picking up datasets sequentially starting with SODA.

hardikyagnik avatar Apr 23 '23 01:04 hardikyagnik

hi i would also like to work on this issue.

sampatkalyan avatar Apr 23 '23 13:04 sampatkalyan

@sampatkalyan, could you work on JokeExplanation?

CloseChoice avatar Apr 23 '23 21:04 CloseChoice

@CloseChoice Yes I will begin the work on the JokeExplanation.

sampatkalyan avatar Apr 24 '23 06:04 sampatkalyan

@sampatkalyan , @hardikyagnik did you already look into the implementation? We would really need this asap. If there is anything I can help you with, let me know otherwise I'll take up the task if this is last blocking thing for the next sft run.

CloseChoice avatar Apr 24 '23 19:04 CloseChoice

Hey @CloseChoice , I noticed that in the current SODA implementation, for each dialogue, the first question is updated by adding a narrative field as a suffix, and the entire dialogue is returned as a list of lists of strings. To convert this to a DatasetEntry, would it be okay to separate out the alternate entries in the dialogue into 'questions' and 'answers' fields as members of the DatasetEntry?

hardikyagnik avatar Apr 25 '23 04:04 hardikyagnik

@hardikyagnik, the structure for dataset entry for SODA would be

DatasetEntry(
questions=[PREFIX + Q1, Q2, Q3, ...],
answers=[A1, A2, A2, ...]

CloseChoice avatar Apr 25 '23 06:04 CloseChoice

@CloseChoice for JokeExplaniation the DatasetEntry( questions=joke, answers= explanation) correct me if i am wrong.

sampatkalyan avatar Apr 25 '23 08:04 sampatkalyan

@sampatkalyan yes, that is correct, please verify that you match the types in the annotations

CloseChoice avatar Apr 25 '23 08:04 CloseChoice

@CloseChoice I am done with my implementation on JokeExplanation. Should I put the pr? And for AlpacaGpt4 class there is a wrong annotation for the dunder method__getitem__ i think it should be DatasetEntry instead of list[str] | tuple[str]. please correct me if i am wrong.

sampatkalyan avatar Apr 25 '23 11:04 sampatkalyan

@CloseChoice In JokeExplanation there is a question and answer variable in the init method which is not used or updated anywhere in the code should I remove it or am I missing something here?

sampatkalyan avatar Apr 25 '23 11:04 sampatkalyan

@sampatkalyan Please create the PR, I'll have a look.

CloseChoice avatar Apr 25 '23 11:04 CloseChoice

For AlpacaGpt4 should i make the change in seperate pr or the same pr

sampatkalyan avatar Apr 25 '23 11:04 sampatkalyan

If it's just a type annotation, put it in the same PR

CloseChoice avatar Apr 25 '23 11:04 CloseChoice

@CloseChoice made the PR please check it. and let me know if there are any changes to be made.

sampatkalyan avatar Apr 25 '23 12:04 sampatkalyan

@CloseChoice added suggested changes can you please check? Thank you.

sampatkalyan avatar Apr 25 '23 14:04 sampatkalyan

Hello @CloseChoice, I've created the PR, let me know if it requires any updates.

hardikyagnik avatar Apr 26 '23 04:04 hardikyagnik