Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Enforce and test common interface for datasets in model_training

Open CloseChoice opened this issue 2 years ago • 2 comments

Currently each training dataset has it's own class in which some transformation of the base dataset is done. Our models rely on specific formats of the data, but this is never enforced and things are broken/might be broken in the future. With proper design and testing we can verify that this might not be happening in the future and find bugs in the dataset transformations that exist currently.

One possible solution for this would be to write a DatasetMixin, that extracts a very small subset of the data (a few samples are enough) and and checks whether the output format corresponds to the expected format. I identify the following todos:

  • [ ] find different required dataset formats for SFT, RM and RL
  • [ ] write mixin class
  • [ ] let current datasets inherit from Mixin class
  • [ ] write tests that perform the check which is implemented by the mixin
  • [ ] document the tested datasets

CloseChoice avatar Apr 04 '23 06:04 CloseChoice

I would like to take this up :)

ShubhamKaushal15 avatar Apr 07 '23 16:04 ShubhamKaushal15

Perfect, if you need any guidance then ping me. Also I have some more suggestions regarding this: there is a streaming mode for the dataset loading. I would suggest that we mock (with the patch function of pytest/unittest the loading method in the tests so that we just stream the dataset (to not take up so much local memory). Then we could just check if the first x elements have the expected types.

So I guess, we don't need the mixin classes. My first goal would be to have the patching functionality. But feel free to implement it differently, I just made a suggestion. Will be happy with anything that works

CloseChoice avatar Apr 07 '23 16:04 CloseChoice