Open-Assistant
Open-Assistant copied to clipboard
Feature/dataset entry
closes #2708
Add pydantic basemodel class (equivalent to dataclass but with stronger guarantees) to return from dolly dataset. Add the formatting functionality in the dataset entry class. This PR does quite a bit:
- add pydantic dependency
- introduce a new DatasetEntry class, which provides a method to do the formatting based on the mode and the
QA_SPECIAL_TOKENS
and theeos_token
. This class should work as a general pattern to store and format single dataset entries. This class should also remove all the formatting errors we had previously with the datasets. - add tests for
DialogueDataCollator
. I trained a minimal tokenizer only on the tokens that are present in the tests to not bloat the code (still a lot of LOC). - added tests for the interplay of
DialogueDataCollator
and the newly introducedDatasetEntry
class. - add small fixes to be backwards compatible in handling the new
DatasetEntry
and old rows from the dataset.
todos:
- [x] mask system
- [x] remove changes in config
- [x] write tests for formatting of dataset entry