transformers
transformers copied to clipboard
Token batching
Hello,
Many frameworks support token batching, in which batches are constructed not so that they contain the same number of sequences, but rather so that they contain approximately the same number of tokens (so a batch could consist either of a large number of short sequences or a small number of long sequences). One motivation for this is so that memory use is roughly constant from batch to batch, which makes it easier to use a very large batch size without risking an out-of-memory error.
For example, this is the behavior when using --max-tokens instead of --batch-size in fairseq.
I found a previous issue (https://github.com/huggingface/transformers/issues/14767) where this was asked. At the time, someone claimed that the feature existed and posted a video. However, the examples presented in that video do not actually implement this feature. Subsequent comments pointed out that the issue remained unresolved, but they were ignored.
So my question is, does token batching already exist in transformers? If so, how can I make use of it?
Thank you for your help! i wasn't sure if I should have made this a feature request because it's not actually clear to me whether the feature has already been implemented or not.
Hi there, what you are asking for is not supported. Note that Transformers is primarily a library of models. You can adapt the data preprocessing part of any of our existing examples to suit your needs, but we won't support every feature out of the box as it's not the goal of the library.
Hello,
Thank you for your quick reply. I'll admit I'm a bit surprised that this is considered out of scope. It is a models library, yes, but the main ways people interact with models are through training (including finetuning) and inference. In either case, inputs need to be batched. This is a very mainstream technique for doing it, especially for self-attention-based models because of the popularity of very large batches (at least when training from scratch, I'm fairly new to finetuning so perhaps the situation is different).
Thank you again for your help.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Token batching is a necessary feature for some tasks like machine translation as it is a recognized setting in the field. When you want to make sure that your experimental setup is consistent with other frameworks, you must do so.