Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

DatasetEntry Roadmap

Open CloseChoice opened this issue 2 years ago • 0 comments

This is the suggestion for a major refactoring of the data preprocessing in the model_training trainer_sft, trainer_rm and trainer_rl logic. The crucial point here is 4. but we need the previous points as preparation.

  • [ ] 1. make RM usable with DatasetEntry (also make sure that oasst dataset works)
  • [ ] 2. refactor DatasetEntry to use different classes (DatasetEntryRM, DatasetEntryRL, DatasetEntrySFT)
  • [ ] 3. generalize preprocessing and filtering (removing "as an AI language model", etc.) and apply this to all datasets (take the performance hit for now)
  • [ ] 4. split dataset preprocessing/tokenization and training run in different steps
  • [ ] 5. add testing properties for the preprocessed & tokenized data

CloseChoice avatar May 01 '23 12:05 CloseChoice