tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Add `postprocessor::Sequence`

Open mishig25 opened this issue 2 years ago • 5 comments

Closes https://github.com/huggingface/tokenizers/issues/873 by implementing preprocessor::Sequence

as suggested by @SaulLu,

Let me just re-share here a design idea that we had - I think - also discussed offline a long time ago: add a new method to all processors - for example named process_chain - which will have another composable signature (a list) and which will be the method called by processors.Sequence. 😊

Please let me know if proposed changes look good, if so, I will add tests. Otherwise, I'm happy to make any changes as necessary. Also, please see the comments as I had some questions

mishig25 avatar May 27 '22 14:05 mishig25

Not sure I understand why we use process_chain ... I think all fn process should take encodings: Vec<Encoding> instead, and Sequence is just a for loop that calls process instead of adding a new endpoint?

thomasw21 avatar May 27 '22 16:05 thomasw21

@Narsil please feel free to do another round of review. The things I've changed based on our huddle on friday are:

  1. process_chain should not use/call process
  2. process_chain should not be optional (i.e. every preprocessor must implement process_chain method)
  3. template processing should err if the input (encodnings vector) length is anything other than 1 or 2
  4. added tests preprocessor (for bytelevel, template, sequence)
  • [ ] add bindings implementation for composable processors

mishig25 avatar May 30 '22 08:05 mishig25

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

we need to remove the code within the process

do you mean, that process implementation should use/call process_chain if so, commit https://github.com/huggingface/tokenizers/pull/1005/commits/78cb806709cd6670e7efa28e2b04c9910ce4aee8 does so

except processors::Sequence::process, process implementations for other processors (Bert, Roberta, ByteLevel, Template) are identical, which is:

fn process(
    &self,
    encoding: Encoding,
    pair_encoding: Option<Encoding>,
    add_special_tokens: bool,
) -> Result<Encoding> {
    let mut encodings = vec![encoding];
    if let Some(encoding) = pair_encoding {
        encodings.push(encoding);
    }

    let encodings = self.process_chain(encodings, add_special_tokens)?;

    <dyn PostProcessor>::merge_encodings(encodings)
}

therefore, made process of PostPorcoessor trait here have a default implementation, wdyt?

mishig25 avatar Jun 01 '22 12:06 mishig25

Hello to all!

Thanks a lot again @mishig25 for taking the lead on this feature! :raised_hands:

I was wondering where we are on this feature! What is currently the sticking point? Is there anything we need to discuss?

SaulLu avatar Jun 27 '22 08:06 SaulLu

Superceded by #1047

mishig25 avatar Aug 24 '22 08:08 mishig25