Customise the separator used for splicing in DataCollatorWithFlattening
What does this PR do?
#31629 added DataCollatorWithFlattening, which packs examples in a small batch into a long sequence and uses -100 to splice the samples and returns position ids for attention calculation.
Since different models may use different token ids for splicing samples during training, for example, when using the Qwen model for post pre-training, short samples can be packed into long samples to speed up training and memory usage, and separated by <|endoftext|>, which token id is 151643. So allowing the user to customise the separator may be a more flexible implementation, allowing the user to use this DataCollator when building the pre-training dataset with different models.
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Models:
- text models: @ArthurZucker
LGTM, indeed this makes sense. can you just update the documentation of this datacolator please!
Updated! Feel free to edit if needed:) @ArthurZucker
Thanks 🤗
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.