tokenizers Support for `pad_encodings` in the Python API

There currently two ways of dynamically batching tokenized sentences with padding

Store them in List[str] form, which is not very satisfying because it requires encoding before batching (potential bottleneck and duplication of work)
Store pre-encoded in List[int] or some kind of tensor and padding manually downstream

I currently use the second option, but since the code is already here in tokenizers::utils::padding::pad_encodings, how do you feel about adding a Python binding for it?

Mar 30 '22 14:03 LoicGrobol

Hi @LoicGrobol

Store them in List[str] form, which is not very satisfying because it requires encoding before batching (potential bottleneck and duplication of work)

Do you have an example showing this is a bottleneck indeed ? In order to do efficient (as in minimal) padding, you need to first encode, just the know the max_len of the batch. There's no way around this I think. Filling the tensors should be extremely fast, even if it's a copy.

Store pre-encoded in List[int] or some kind of tensor and padding manually downstream

Pre-allocating is nice and is definitely useful sometimes, but it also cannot be done without knowing the max_len of the batch either, unless you are authorized to truncate, but then do you want to pre-allocate a seq=512 tensor because some strings can reach that, even if your batch doesn't fill that ? I have seen huge performance decrease by doing this approach. The pad tokens ARE processed by the model, and since it's O(n²) then it can cost quite a lot to have "wasted" padded slots. So it's a very nice approach when it fits a use case, but it's not necessarily the best approach in general, no ?

Happy to hear more about what you had in mind and see if we can cook something nice. Could you share a dummy sample code about what you would like to achieve ? And how currently you are achieving it ?

Mar 30 '22 15:03 Narsil

Hi, @Narsil, so currently I do this in zeldarose:

We load, tokenize (truncating) and encode text data on a single node and store it as a datasets object at the start of training.
Then the training processes on every node simply access the dataset to get batches of sample, which are batch+padded at this point.

This is quite fast, because everything is already encoded when we get to 2. because we just have to manipulate tensors and these are easy to use in a distributed setting, List[str] not so much. Also we run many epochs and it would be a bit frustrating to have to re-encode the same samples several times.

Mostly at a higher level, I guess what I'd like is a pad method for fast tokenizers in transformers, but I guess it would have to start here, right? 😄

Apr 01 '22 08:04 LoicGrobol

This is quite fast, because everything is already encoded when we get to 2. because we just have to manipulate tensors and these are easy to use in a distributed setting, List[str] not so much. Also we run many epochs and it would be a bit frustrating to have to re-encode the same samples several times.

I think what you are doing currently is actually optimal and should be the recommended way to operate. Doing batch+padding on the distributed processes is nice since you can get random dataset access (so random padded length). Padding+batching on the spot is in my experience super fast and works very well.

The only thing you could do to further optimize would be to save the batched samples in your dataset, but I think this is super detrimental to random batches which is necessary for correct learning. This could be applied to the validation set, where randomness is not important (but I don't think the effort is worth it, at least it never was for me).

Adding pad_encoding to the bindings is doable, but your current code is even better, since you don't even have to tokenize anymore when doing the training, you're saving tokenization_time x ( n_epochs - 1 ) (and more if your relaunch training, since tokenization is kept throughout, no ?

If someone is space constrained, and cannot save/distribute the ids themselves, then the padding could actually be nice, but it's sort of already exposed through:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("xlm-roberta-base")
tokenizer.enable_padding(pad_id=0)
tokenizer.encode_batch(["This is ", "a test tha tis much longer"])[0].ids
# [0, 3293, 83, 2, 0, 0, 0, 0, 0]

So we're not returning the tensor, but the lists are already pre-padded.

That being said, I really think your approach is better, tokenize early, and batch+pad at the latest possible time.

Apr 04 '22 10:04 Narsil

Hi, I'm not sure I understand

Adding pad_encoding to the bindings is doable, but your current code is even better, since you don't even have to tokenize anymore when doing the training, you're saving tokenization_time x ( n_epochs - 1 ) (and more if your relaunch training, since tokenization is kept throughout, no ?

What I am thinking of is if you have batch: List[tokenizers.Encoding], instead of having to pad with

padded_batch = pad_sequence(
            [torch.tensor(sample.ids, dtype=torch.long) for sample in batch],
            batch_first=True,
            padding_value=padding_value,
        )

having a tokenizer.pad_encodings that could be used as

padded_batch = tokenizer.pad_encodings(batch)

This way things like padding_value or other tokenizing internals don't leak out: the tokenizer knows how it is supposed to pad encodings.

Apr 10 '22 08:04 LoicGrobol

Ok, I see what you mean, and indeed if the tokenizer already knows about the padding value it's definitely something to consider in terms of internal information not leaking as you say.

Definitely worth exposing something like tokenizer.pad_encodings.

@SaulLu Is it something that might also help in transformers ? Or are all paths using batch_encode ?

Apr 12 '22 07:04 Narsil

Thanks for pinging me on this issue! It is indeed a good approach!

In transformers, for the moment, when we use a method that encodes, we go through the encode_batch method. Nevertheless, I think it would be interesting to see if we can also propose / how we could manage to reproduce the use case mentioned here with the transformers tokenizers.

Apr 12 '22 10:04 SaulLu

Oh, actually what I do here is via transformers tokenizers, but I figured that getting a Python binding for the Rust padding function would be a good first step :-)

Apr 12 '22 13:04 LoicGrobol

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Feb 20 '24 01:02 github-actions[bot]