Nicolas Patry comments

Results 977 comments of


                                            Nicolas Patry

vocab_size issue with Whitespace pre_tokenizer

Sorry I can't download that much (36Go) right now. Could you share your output tokenizer ? We cuold then check my first hypothesis, it will probably be faster.

vocab_size issue with Whitespace pre_tokenizer

The tokenizer seems incomplete, but does contain every chinese/japanses character on own, so my guess is probably correct, you **need** a `ByteLevel` of some kind because you're current alphabet is...

vocab_size issue with Whitespace pre_tokenizer

You could try to filter that data manually by loooking at the unicode char script of each character. https://stackoverflow.com/questions/9868792/find-out-the-unicode-script-of-a-character (First and last answer seem viable IMO.) Something along the lines...

vocab_size issue with Whitespace pre_tokenizer

On the core issue of wether the vocab should be trimmed or not, I am a little torn. I do tend to sympathize with your expectations. If a user wants...

XLM-Roberta offset mapping is off by one in case of whitespace-subwords

Hi @robvanderg , This is *normal* so to speak with how the tokenizer was configured (which we can debate ofc). This tokenizer, uses a space splitting which eats up the...

XLM-Roberta offset mapping is off by one in case of whitespace-subwords

> Thanks for the detailed reply! I understand that it is a tricky case to fix. Would it be save to assume that any overlapping tuples, where the first is...

PanicException For Result::unwarp()

Hi @Namco0816 , you dataset is probably big enough to outrange `i32` (`2147483647`). This is unfortunately a known limitation of this library, which doesn't gracefully upgrade to `u64` when such...

"GLIBC_2.29 not found" on nodejs binding

Hi @remagpie , Thanks for the report ! For Python we have a `manylinux2010` build to support old glibc and more linux support. Adding such a thing for `node` would...

tokenizer.save_vocabulary()

Hey @kkavyashankar0009 , Have you tried creating the said directory ?

tokenizer.save_vocabulary()

Can you give a reproducing example ?