Nicolas Patry comments

Results 977 comments of


                                            Nicolas Patry

JS / WebAssembly binding planned ?

> switch to a pure-rust regex like `fancy_regex`, and test all the serialisations against onig You're welcome to try. But how do you cover **all possible cases** without actually proving...

JS / WebAssembly binding planned ?

Sorry I didn't reply earlier, I might have missed the notification. > I wonder whether there could be some sort of config that is saved along with tokenizer data such...

Issue with space tokens + BPE tokenizer

Hi @sdtblck , Thanks for the reproducible script, very easy to reproduce. Turns out your tokenizer was trained with `add_prefix_space` which will add a prefix space on words when decoding....

Issue with space tokens + BPE tokenizer

Like most options in this library, they exist to emulate previous published work that behave in a certain way and it was needed to behave in exactly the same way...

unexpected behaviour byte char level tokenizer

Hi @glample , This is expected because the `pre_tokenizer`splits with this regexp: https://github.com/huggingface/tokenizers/blob/master/tokenizers/src/pre_tokenizers/byte_level.rs#L35 https://github.com/huggingface/transformers/blob/master/src/transformers/models/gpt2/tokenization_gpt2.py#L193 Which can be traced to the original implementation: https://github.com/openai/gpt-2/blob/master/src/encoder.py#L53 Because the tokens a presplit, the BPE...

unexpected behaviour byte char level tokenizer

This might lead to other consequences though, so maybe be careful about the `decoder` part. The best way might be to try to use the raw components directly: Components https://huggingface.co/docs/tokenizers/python/latest/components.html...

unexpected behaviour byte char level tokenizer

You are entirely correct that it should be overridable as mentionned in the TODO. To my knowledge this is doable, the replace `normalizer` does it (https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#tokenizers.normalizers.Replace). Personnally I don't think,...

Nicolas Patry

JS / WebAssembly binding planned ?

JS / WebAssembly binding planned ?

Issue with space tokens + BPE tokenizer

Issue with space tokens + BPE tokenizer

unexpected behaviour byte char level tokenizer

unexpected behaviour byte char level tokenizer

unexpected behaviour byte char level tokenizer

Missing documentation for `BertWordPieceTokenizer`

Parallelize unigram trainer

Error: ThreadPoolBuildError