Open-Assistant
Open-Assistant copied to clipboard
DatasetEntry Roadmap
This is the suggestion for a major refactoring of the data preprocessing in the model_training trainer_sft, trainer_rm and trainer_rl logic. The crucial point here is 4. but we need the previous points as preparation.
- [ ] 1. make RM usable with DatasetEntry (also make sure that oasst dataset works)
- [ ] 2. refactor DatasetEntry to use different classes (DatasetEntryRM, DatasetEntryRL, DatasetEntrySFT)
- [ ] 3. generalize preprocessing and filtering (removing "as an AI language model", etc.) and apply this to all datasets (take the performance hit for now)
- [ ] 4. split dataset preprocessing/tokenization and training run in different steps
- [ ] 5. add testing properties for the preprocessed & tokenized data