Nicolas Patry

Results 977 comments of Nicolas Patry

> switch to a pure-rust regex like `fancy_regex`, and test all the serialisations against onig You're welcome to try. But how do you cover **all possible cases** without actually proving...

Sorry I didn't reply earlier, I might have missed the notification. > I wonder whether there could be some sort of config that is saved along with tokenizer data such...

Hi @sdtblck , Thanks for the reproducible script, very easy to reproduce. Turns out your tokenizer was trained with `add_prefix_space` which will add a prefix space on words when decoding....

Like most options in this library, they exist to emulate previous published work that behave in a certain way and it was needed to behave in exactly the same way...

Hi @glample , This is expected because the `pre_tokenizer`splits with this regexp: https://github.com/huggingface/tokenizers/blob/master/tokenizers/src/pre_tokenizers/byte_level.rs#L35 https://github.com/huggingface/transformers/blob/master/src/transformers/models/gpt2/tokenization_gpt2.py#L193 Which can be traced to the original implementation: https://github.com/openai/gpt-2/blob/master/src/encoder.py#L53 Because the tokens a presplit, the BPE...

This might lead to other consequences though, so maybe be careful about the `decoder` part. The best way might be to try to use the raw components directly: Components https://huggingface.co/docs/tokenizers/python/latest/components.html...

You are entirely correct that it should be overridable as mentionned in the TODO. To my knowledge this is doable, the replace `normalizer` does it (https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#tokenizers.normalizers.Replace). Personnally I don't think,...

Hi @BlueskyFR, You're more than welcome to update the documentation. `tokenizers` works without really thinking or expressing things relative to existing models like `Bert` or `GPT2` The main documentation refers...

Hey @mishig25 Thanks for this ! have you actually checked the previous implementation which uses `par_chunks` ? https://github.com/huggingface/tokenizers/pull/921 I didn't do a deep dive into your implementation yet, but it...

Can you try using your script with `TOKENIZERS_PARALLELISM=false` enabled ? This should deactivate parallelism within tokenizers and removing the error. If this works, you can write your code single threaded...