`PostProcessor` might have wrong tokens/ids
Whenever we instantiate a new PostProcessor (asof writing this issue, one of BertProcessing, RobertaProcessing or TemplateProcessing), we specify pairs of token: String and id: u32 for the special tokens that the said PostProcessor is going to add. One problem is that there is absolutely no check to verify that the provided tokens/ids make sense to the Tokenizer that will use this PostProcessor.
We might want to find a way to prevent this, by checking somehow that the PostProcessor is configured as expected.
One potential solution might be to add a new method to the trait PostProcessor:
/// Use the provided tokenizer to check that the data we possess is accurate.
/// We can for example use `tokenizer.token_to_id` & `tokenizer.id_to_token`.
fn validate(&self, tokenizer: &Tokenizer) -> Result<()> { Ok(()) }
This method could be called by the Tokenizer when we attach a PostProcessor, making this totally transparent for the user while raising an error in the presence of wrong values.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.