Easy-Transformer
Easy-Transformer copied to clipboard
Make `tokenize_and_concatenate` work with more datasets
Description
TL;DR I tried to use tokenize_and_concatenate
with a general HF Dataset. tokenize_and_concatenate
was the function I wanted for my use case [1] but I had problems (two blocked by an error message [2], one that passed through without my knowledge [3])
So this PR fixes this and keeps backward compatibility. The PR:
[1] Makes the function work with more than just arrow datasets. Maybe this was a feature in past HF, but now most datasets are not arrow datasets
[2] a) Only passes num_proc
if we aren't streaming -- this causes a bug when IterableDataset
are passed to this function
[2] b) Similarly, skips the final formatting as this also causes a bug with IterableDataset
s
[3] Makes remove padding optional. E.g Pythia's training data had pad tokens, sometimes we want this
Type of change
Please delete options that are not relevant.
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ x ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] This change requires a documentation update
Checklist:
- [ x ] I have commented my code, particularly in hard-to-understand areas
- [ x ] I have made corresponding changes to the documentation
- [ x ] My changes generate no new warnings
- [ x ] I have added tests that prove my fix is effective or that my feature works
- [ x ] New and existing unit tests pass locally with my changes
- [ x ] I have not rewritten tests relating to key interfaces which would affect backward compatibility