Open-Assistant DatasetEntry Roadmap

DatasetEntry Roadmap

Open CloseChoice opened this issue 2 years ago • 0 comments

This is the suggestion for a major refactoring of the data preprocessing in the model_training trainer_sft, trainer_rm and trainer_rl logic. The crucial point here is 4. but we need the previous points as preparation.

[ ] 1. make RM usable with DatasetEntry (also make sure that oasst dataset works)
[ ] 2. refactor DatasetEntry to use different classes (DatasetEntryRM, DatasetEntryRL, DatasetEntrySFT)
[ ] 3. generalize preprocessing and filtering (removing "as an AI language model", etc.) and apply this to all datasets (take the performance hit for now)
[ ] 4. split dataset preprocessing/tokenization and training run in different steps
[ ] 5. add testing properties for the preprocessed & tokenized data

May 01 '23 12:05 CloseChoice

Open-Assistant Open-Assistant copied to clipboard

DatasetEntry Roadmap

Open-Assistant
Open-Assistant copied to clipboard