Make `tokenize_and_concatenate` work with more datasets

Open ArthurConmy opened this issue 1 year ago • 0 comments

Description

TL;DR I tried to use tokenize_and_concatenate with a general HF Dataset. tokenize_and_concatenate was the function I wanted for my use case [1] but I had problems (two blocked by an error message [2], one that passed through without my knowledge [3])

So this PR fixes this and keeps backward compatibility. The PR:

[1] Makes the function work with more than just arrow datasets. Maybe this was a feature in past HF, but now most datasets are not arrow datasets [2] a) Only passes num_proc if we aren't streaming -- this causes a bug when IterableDataset are passed to this function [2] b) Similarly, skips the final formatting as this also causes a bug with IterableDatasets [3] Makes remove padding optional. E.g Pythia's training data had pad tokens, sometimes we want this

Type of change

Please delete options that are not relevant.

[ ] Bug fix (non-breaking change which fixes an issue)
[ x ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] This change requires a documentation update

Checklist:

[ x ] I have commented my code, particularly in hard-to-understand areas
[ x ] I have made corresponding changes to the documentation
[ x ] My changes generate no new warnings
[ x ] I have added tests that prove my fix is effective or that my feature works
[ x ] New and existing unit tests pass locally with my changes
[ x ] I have not rewritten tests relating to key interfaces which would affect backward compatibility

Dec 28 '23 22:12 ArthurConmy

Easy-Transformer Easy-Transformer copied to clipboard

Make `tokenize_and_concatenate` work with more datasets

Description

Type of change

Checklist:

Easy-Transformer
Easy-Transformer copied to clipboard