trl
trl copied to clipboard
Support packing for pretokenized datasets
At this point, trl returns the dataset as is if the provided dataset has signs of being tokenized already. https://github.com/huggingface/trl/blob/98ad01ddfd1e1b67ec018014b83cba40e0caea66/trl/trainer/sft_trainer.py#L503
Additionally, I see the ConstantLengthDataset https://github.com/huggingface/trl/blob/98ad01ddfd1e1b67ec018014b83cba40e0caea66/trl/trainer/utils.py#L426 has been written only in support of data that is not pretokenized and it should be possible to extend to pretokenized case as well.
Is there of any interest to support packing for pretokenized datasets? if so, I will be interested to contribute.