Open-Assistant
Open-Assistant copied to clipboard
Implement Dataset Entry for Datasets
We have a new DatasetEntry
class which helps us to generalize over datasets and enforce a common formatting.
We need to implement this class for a couple of more datasets:
- [ ] SODA (@hardikyagnik)
- [ ] JokeExplanation (@sampatkalyan, needs approval from CodeOwner)
- [ ] WebGPT (@CloseChoice, needs review)
- [x] Alpaca (@CloseChoice
- [x] AlpacaGpt4 (@CloseChoice)
All of these datasets are found here. An example of how the new class is implemented is found here.
PRs can be implementations for a SINGLE Dataset. If you implement this, then checking the dataset with our script would be nice (but optional):
python check_dataset_appearances.py -d <dataset-name> --cache_dir <cache-dir> --mode sft
If you take up one of the datasets, please mention the dataset name in your comment, then I'll mention you at the corresponding dataset
I would like to take up this task.
@hardikyagnik nice, which dataset are you working on? I just ask, so that others can work on the other datasets simultaneously
I would like to take up this task.
@hardikyagnik ... I assigned you .. as @CloseChoice said it would be great if you could specify which one you will work on .. potentially others could help (we need this as soon as possible).
Hey @CloseChoice, @andreaskoepf, I will start picking up datasets sequentially starting with SODA
.
hi i would also like to work on this issue.
@sampatkalyan, could you work on JokeExplanation?
@CloseChoice Yes I will begin the work on the JokeExplanation.
@sampatkalyan , @hardikyagnik did you already look into the implementation? We would really need this asap. If there is anything I can help you with, let me know otherwise I'll take up the task if this is last blocking thing for the next sft run.
Hey @CloseChoice , I noticed that in the current SODA implementation, for each dialogue, the first question is updated by adding a narrative field as a suffix, and the entire dialogue is returned as a list of lists of strings. To convert this to a DatasetEntry, would it be okay to separate out the alternate entries in the dialogue into 'questions' and 'answers' fields as members of the DatasetEntry?
@hardikyagnik, the structure for dataset entry for SODA would be
DatasetEntry(
questions=[PREFIX + Q1, Q2, Q3, ...],
answers=[A1, A2, A2, ...]
@CloseChoice for JokeExplaniation the DatasetEntry( questions=joke, answers= explanation) correct me if i am wrong.
@sampatkalyan yes, that is correct, please verify that you match the types in the annotations
@CloseChoice I am done with my implementation on JokeExplanation. Should I put the pr? And for AlpacaGpt4 class there is a wrong annotation for the dunder method__getitem__ i think it should be DatasetEntry instead of list[str] | tuple[str]. please correct me if i am wrong.
@CloseChoice In JokeExplanation there is a question and answer variable in the init method which is not used or updated anywhere in the code should I remove it or am I missing something here?
@sampatkalyan Please create the PR, I'll have a look.
For AlpacaGpt4 should i make the change in seperate pr or the same pr
If it's just a type annotation, put it in the same PR
@CloseChoice made the PR please check it. and let me know if there are any changes to be made.
@CloseChoice added suggested changes can you please check? Thank you.
Hello @CloseChoice, I've created the PR, let me know if it requires any updates.