Nicolas Patry

Results 978 comments of Nicolas Patry

No we just need a core maintainer's approval. Sorry I forgot about this PR. @sgugger for final review.

Oh sorry ! Missed this one it's OK !

True, I uncovered more issues around multiple space handling, I'm nailing down on the pre_tokenizer combo for it.

More troublesome than anticipated. When encoding `" Hello"` from a pure BPE perspectivve, `tokenizers` does `[259, 10994]` (`" "` + `Hello`) whereas spm does `[29871, 15043]` (`" "` + `"...

For the doc builder, we're going to need an update on the docker image so that it pulls 0.13.3 to generate the doc.

> Hi @Narsil , > > the `warning.warn` to `raise RuntimeError` change in `src/transformers/convert_slow_tokenizer.py` breaks a lot of things: I wanted to fine-tune a mT5 model and it is now...

Both are using Unigram with ByteFallback which isn't supported yet.

Which repo are you using? We need to create the fast files on the repo. Converting from slow is super slow and there's nothing to be done about it (tokenizers...

Do you have the faulty sample too ? I cannot reproduce with a dummy file ? @ArthurZucker it does look like the last token is indeed not a timestamp, but...