Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Feature/dataset entry

Open CloseChoice opened this issue 1 year ago • 0 comments

closes #2708

Add pydantic basemodel class (equivalent to dataclass but with stronger guarantees) to return from dolly dataset. Add the formatting functionality in the dataset entry class. This PR does quite a bit:

  • add pydantic dependency
  • introduce a new DatasetEntry class, which provides a method to do the formatting based on the mode and the QA_SPECIAL_TOKENS and the eos_token. This class should work as a general pattern to store and format single dataset entries. This class should also remove all the formatting errors we had previously with the datasets.
  • add tests for DialogueDataCollator. I trained a minimal tokenizer only on the tokens that are present in the tests to not bloat the code (still a lot of LOC).
  • added tests for the interplay of DialogueDataCollator and the newly introduced DatasetEntry class.
  • add small fixes to be backwards compatible in handling the new DatasetEntry and old rows from the dataset.

todos:

  • [x] mask system
  • [x] remove changes in config
  • [x] write tests for formatting of dataset entry

CloseChoice avatar Apr 14 '23 18:04 CloseChoice