Nicolas Patry comments

Results 978 comments of


                                            Nicolas Patry

fix: Text splitting in the BasicTokenizer

No we just need a core maintainer's approval. Sorry I forgot about this PR. @sgugger for final review.

fix: Text splitting in the BasicTokenizer

Oh sorry ! Missed this one it's OK !

[WIP] Adding Llama FastTokenizer support.

True, I uncovered more issues around multiple space handling, I'm nailing down on the pre_tokenizer combo for it.

[WIP] Adding Llama FastTokenizer support.

More troublesome than anticipated. When encoding `" Hello"` from a pure BPE perspectivve, `tokenizers` does `[259, 10994]` (`" "` + `Hello`) whereas spm does `[29871, 15043]` (`" "` + `"...

[WIP] Adding Llama FastTokenizer support.

For the doc builder, we're going to need an update on the docker image so that it pulls 0.13.3 to generate the doc.

[WIP] Adding Llama FastTokenizer support.

> Hi @Narsil , > > the `warning.warn` to `raise RuntimeError` change in `src/transformers/convert_slow_tokenizer.py` breaks a lot of things: I wanted to fine-tune a mT5 model and it is now...

[WIP] Adding Llama FastTokenizer support.

Both are using Unigram with ByteFallback which isn't supported yet.

[WIP] Adding Llama FastTokenizer support.

Which repo are you using? We need to create the fast files on the repo. Converting from slow is super slow and there's nothing to be done about it (tokenizers...

[WIP] Adding Llama FastTokenizer support.

@ArthurZucker

WhisperTimeStampLogitsProcessor error while using Whisper pipelines. Was WhisperTimeStampLogitsProcessor used?

Do you have the faulty sample too ? I cannot reproduce with a dummy file ? @ArthurZucker it does look like the last token is indeed not a timestamp, but...