Nicolas Patry comments

Results 977 comments of


                                            Nicolas Patry

Support `wasm`

@mbrunel would love some help. If you want to get started, some discussions have been happening here. https://github.com/huggingface/tokenizers/issues/63 The main roadblock seems to be the regex engine @josephrocca found that...

Support `wasm`

Hi @mbrunel , Actually I started some work in #1009 to integrate fully the work as this feature seemed to have more traction that I expected (and it was a...

Support `wasm`

Hey @dhuynh95 It's actually pretty cool !! Congrats. I didn't even realize your use case was running things in an enclave. Thanks for working in re-adding other layers missing from...

Problem adding token with a specific replace normalizer

I am not sure what are your expectations ? `[NUM]` seems present in both and correct, just the non ascii symbols `?` and `'` are converted to spaces which is...

0.11.5 and 0.11.6 packages not compatible with manylinux2010

Hi @vgod-dbx , Thanks for sharing this. The library is built automatically on `manylinux2010` as done here: https://github.com/huggingface/tokenizers/blob/master/.github/workflows/python-release.yml#L18 Script for building is here : https://github.com/huggingface/tokenizers/blob/master/bindings/python/build-wheels.sh The only thing I see...

0.11.5 and 0.11.6 packages not compatible with manylinux2010

Hi @vgod-dbx , Thank you very much for this investigation ! Saved me lots of time for sure. It seems like the merge happened quite a while ago (15sep 2021)....

Pre-trainined German tokenizers for BPE or Subword embeddings?

Hi @Yuvaraj91 you can check out all possible models available (and their associated tokenizers) here : https://huggingface.co/models?language=de&sort=downloads You can sort on the left on langage, does that work for you...

"the" token is splitted to "t" "h" "e" in large scale corpus

It's possible, I made a temporary branch with u64 everywhere instead of u32: https://github.com/huggingface/tokenizers/tree/u64_branch It's a temporary fix, we probably need to think a bit more about how to go...

"the" token is splitted to "t" "h" "e" in large scale corpus

If "th" and "he" are not tokens either you're probably going to have a subpar tokenizer

"the" token is splitted to "t" "h" "e" in large scale corpus

I didn't mean to close.