Nicolas Patry

Results 977 comments of Nicolas Patry

Sorry I can't download that much (36Go) right now. Could you share your output tokenizer ? We cuold then check my first hypothesis, it will probably be faster.

The tokenizer seems incomplete, but does contain every chinese/japanses character on own, so my guess is probably correct, you **need** a `ByteLevel` of some kind because you're current alphabet is...

You could try to filter that data manually by loooking at the unicode char script of each character. https://stackoverflow.com/questions/9868792/find-out-the-unicode-script-of-a-character (First and last answer seem viable IMO.) Something along the lines...

On the core issue of wether the vocab should be trimmed or not, I am a little torn. I do tend to sympathize with your expectations. If a user wants...

Hi @robvanderg , This is *normal* so to speak with how the tokenizer was configured (which we can debate ofc). This tokenizer, uses a space splitting which eats up the...

> Thanks for the detailed reply! I understand that it is a tricky case to fix. Would it be save to assume that any overlapping tuples, where the first is...

Hi @Namco0816 , you dataset is probably big enough to outrange `i32` (`2147483647`). This is unfortunately a known limitation of this library, which doesn't gracefully upgrade to `u64` when such...

Hi @remagpie , Thanks for the report ! For Python we have a `manylinux2010` build to support old glibc and more linux support. Adding such a thing for `node` would...

Hey @kkavyashankar0009 , Have you tried creating the said directory ?

Can you give a reproducing example ?