trl icon indicating copy to clipboard operation
trl copied to clipboard

Support packing for pretokenized datasets

Open kmehant opened this issue 7 months ago • 7 comments

At this point, trl returns the dataset as is if the provided dataset has signs of being tokenized already. https://github.com/huggingface/trl/blob/98ad01ddfd1e1b67ec018014b83cba40e0caea66/trl/trainer/sft_trainer.py#L503

Additionally, I see the ConstantLengthDataset https://github.com/huggingface/trl/blob/98ad01ddfd1e1b67ec018014b83cba40e0caea66/trl/trainer/utils.py#L426 has been written only in support of data that is not pretokenized and it should be possible to extend to pretokenized case as well.

Is there of any interest to support packing for pretokenized datasets? if so, I will be interested to contribute.

kmehant avatar Jul 17 '24 17:07 kmehant