trlx
trlx copied to clipboard
Give a way to pass data that doesn't involve loading the whole prompt dataset to memory
🚀 The feature, motivation, and pitch
Currently TLRX, as far as I know, requires that you pass a list of strings as the prompts. However, at larger scales, if the list of prompts is very big, this could be a problem memory-wise.
It can be an ok trade-off to do live preprocessing with dataloader workers & batch pre-fetch vs doing everything in advance / having a big load time at the begining, & the current approach of requiring everything gets done in advance really reduces this potential.
Again, as a library that is / will be widely used, it's hard to predict the uses, so being more flexible is likely better. Maybe people would like to implement some fancy continual learning scheme where the dataset is live modified, that is not possible in the current setting. I know a lot of people I work with often do weird advanced stuff in the collate function as well, so this is also something that is not possible.
With the potential addition of extra information in https://github.com/CarperAI/trlx/issues/301, more complex dataloading would become possible too.
My suggestion is that the trainer could also accept dataset objects or dataloader objects in addition to the list of strings that it currently does.
Would be curious to hear you folks' thoughts on this.