tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

`PostProcessor` might have wrong tokens/ids

Open n1t0 opened this issue 5 years ago • 1 comments

Whenever we instantiate a new PostProcessor (asof writing this issue, one of BertProcessing, RobertaProcessing or TemplateProcessing), we specify pairs of token: String and id: u32 for the special tokens that the said PostProcessor is going to add. One problem is that there is absolutely no check to verify that the provided tokens/ids make sense to the Tokenizer that will use this PostProcessor.

We might want to find a way to prevent this, by checking somehow that the PostProcessor is configured as expected.

One potential solution might be to add a new method to the trait PostProcessor:

/// Use the provided tokenizer to check that the data we possess is accurate.
/// We can for example use `tokenizer.token_to_id` & `tokenizer.id_to_token`.
fn validate(&self, tokenizer: &Tokenizer) -> Result<()> { Ok(()) }

This method could be called by the Tokenizer when we attach a PostProcessor, making this totally transparent for the user while raising an error in the presence of wrong values.

n1t0 avatar Sep 10 '20 19:09 n1t0

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 16 '24 01:05 github-actions[bot]