tokenizers
tokenizers copied to clipboard
Add `postprocessor::Sequence`
Closes https://github.com/huggingface/tokenizers/issues/873 by implementing preprocessor::Sequence
as suggested by @SaulLu,
Let me just re-share here a design idea that we had - I think - also discussed offline a long time ago: add a new method to all processors - for example named process_chain - which will have another composable signature (a list) and which will be the method called by processors.Sequence. 😊
Please let me know if proposed changes look good, if so, I will add tests. Otherwise, I'm happy to make any changes as necessary. Also, please see the comments as I had some questions
Not sure I understand why we use process_chain
... I think all fn process
should take encodings: Vec<Encoding>
instead, and Sequence
is just a for loop that calls process
instead of adding a new endpoint?
@Narsil please feel free to do another round of review. The things I've changed based on our huddle on friday are:
-
process_chain
should not use/callprocess
-
process_chain
should not be optional (i.e. every preprocessor must implementprocess_chain
method) - template processing should err if the input (encodnings vector) length is anything other than 1 or 2
- added tests preprocessor (for bytelevel, template, sequence)
- [ ] add bindings implementation for composable processors
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
we need to remove the code within the process
do you mean, that process
implementation should use/call process_chain
if so, commit https://github.com/huggingface/tokenizers/pull/1005/commits/78cb806709cd6670e7efa28e2b04c9910ce4aee8 does so
except processors::Sequence::process
, process
implementations for other processors (Bert, Roberta, ByteLevel, Template) are identical, which is:
fn process(
&self,
encoding: Encoding,
pair_encoding: Option<Encoding>,
add_special_tokens: bool,
) -> Result<Encoding> {
let mut encodings = vec![encoding];
if let Some(encoding) = pair_encoding {
encodings.push(encoding);
}
let encodings = self.process_chain(encodings, add_special_tokens)?;
<dyn PostProcessor>::merge_encodings(encodings)
}
therefore, made process
of PostPorcoessor
trait here have a default implementation, wdyt?
Hello to all!
Thanks a lot again @mishig25 for taking the lead on this feature! :raised_hands:
I was wondering where we are on this feature! What is currently the sticking point? Is there anything we need to discuss?
Superceded by #1047