tokenizers Add a `Sequence` to the processors

I think it could be useful to have a Sequence object for (post)-processors.

Indeed, I think a tokenizer might need to combine the ByteLevel post-processor and a TemplateProcessing processor.

Today RobertaProcessing is a processor that combines these 2 features in some way but we can't use our own template.

What do you think? :smile:

Jan 06 '22 16:01 SaulLu

In principle, very favorable.

In practice:

def process(encoding: Encoding, pair_encoding: Encoding, add_special_tokens: bool) -> Encoding

So with this signature, it's destructive show not really obvious how we should "chain" them. If we do

for processor in processors:
    encoding = processor(encoding, pair_encoding, add_special_tokens)
    pair_encoding = None

Then is means, only the first one will ever know that there was a pair meaning the order of the sequence will be important in a non-obvious way IMO. (That are ALL see the pair encoding, probably concatenating more than necessary.)

What do you think ? Should we still do it ?

Food for thought: There is request to support more than a pair of encodings: https://github.com/huggingface/tokenizers/issues/804. Maybe to take into consideration if we envision making bigger changes to Processor to enable sequencing them.

Jan 07 '22 08:01 Narsil

@Narsil is there a reason no to do

for processor in processors:
    encoding, pair_encoding = processor(encoding, pair_encoding, add_special_tokens)
# This last one is really just needed for backward compatilibity, pair_encoding is None in this case.
encoding, pair_encoding = post_process.default_process(encoding, pair_encoding, add_special_tokens)

I feel you could actually remove <dyn PostProcessor>::default_process(encoding, pair_encoding, add_special_tokens) from most pre-processors. and add it at the end of the sequence. Typically ByteLevel become Sequence([ByteLevel, MergePair]), but in order not to break backward compat we can do the top solution?

May 27 '22 10:05 thomasw21

This is roughly what I suggested to @mishig25 as a solution.

Using Vec<Encoding> actually instead of pairs since pair is also limiting in some form (#804 ) but roughly it's this.

I also suggested a way to implement it in a non backward incompatible way (like I wanted to do for Decoder trait)

May 27 '22 10:05 Narsil

Let me just re-share here a design idea that we had - I think - also discussed offline a long time ago: add a new method to all processors - for example named process_chain - which will have another composable signature (a list) and which will be the method called by processors.Sequence. :blush:

May 27 '22 10:05 SaulLu

Using Vec<Encoding> actually instead of pairs since pair is also limiting in some form (https://github.com/huggingface/tokenizers/issues/804 ) but roughly it's this.

Maybe a HashMap also?

Let me just re-share here a design idea that we had - I think - also discussed offline a long time ago: add a new method to all processors - for example named process_chain - which will have another composable signature (a list) and which will be the method called by processors.Sequence. 😊

I'm not sure how process_chain is different from the for loop? Do you mean that some processors can have different signatures and users have to make sure that input/outputs are compatible sequentially?

May 27 '22 11:05 thomasw21

It's more that I think it's safer to keep the methods exposed before for backward compatibility reasons (the v0.12.0 release of tokenizers had to be yanked for this reason, cf #971) and add new ones for the new feature.

May 27 '22 12:05 SaulLu

It's more that I think it's safer to keep the methods exposed before for backward compatibility reasons

I don't agree, it's a lot more work to maintain 2x the number of endpoints, and in this case I don't see how we're breaking backward compatibility, unless someone outside this repo is using process. IMO If we are, we probably need to know where and how to make the decision to expose a second endpoint.

May 27 '22 17:05 thomasw21

@thomasw21, I was wondering, but by changing the process signature with your method, the output format of a tokenizer that only has a ByteLevel processor will be changed, right?

Regarding the maintenance of 2 endpoints, we could reuse process_chain in process, no?

May 30 '22 07:05 SaulLu

Essentially this is what I believe @mishig25 has done, but I still think we shouldn't have such a design as it allows

Sequence([ByteLevel])

and

ByteLevel

To have different behaviours since it calls two different endpoints.

May 30 '22 08:05 thomasw21

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Mar 06 '24 01:03 github-actions[bot]