Nicolas Patry
Nicolas Patry
@mbrunel would love some help. If you want to get started, some discussions have been happening here. https://github.com/huggingface/tokenizers/issues/63 The main roadblock seems to be the regex engine @josephrocca found that...
Hi @mbrunel , Actually I started some work in #1009 to integrate fully the work as this feature seemed to have more traction that I expected (and it was a...
Hey @dhuynh95 It's actually pretty cool !! Congrats. I didn't even realize your use case was running things in an enclave. Thanks for working in re-adding other layers missing from...
I am not sure what are your expectations ? `[NUM]` seems present in both and correct, just the non ascii symbols `?` and `'` are converted to spaces which is...
Hi @vgod-dbx , Thanks for sharing this. The library is built automatically on `manylinux2010` as done here: https://github.com/huggingface/tokenizers/blob/master/.github/workflows/python-release.yml#L18 Script for building is here : https://github.com/huggingface/tokenizers/blob/master/bindings/python/build-wheels.sh The only thing I see...
Hi @vgod-dbx , Thank you very much for this investigation ! Saved me lots of time for sure. It seems like the merge happened quite a while ago (15sep 2021)....
Hi @Yuvaraj91 you can check out all possible models available (and their associated tokenizers) here : https://huggingface.co/models?language=de&sort=downloads You can sort on the left on langage, does that work for you...
It's possible, I made a temporary branch with u64 everywhere instead of u32: https://github.com/huggingface/tokenizers/tree/u64_branch It's a temporary fix, we probably need to think a bit more about how to go...
If "th" and "he" are not tokens either you're probably going to have a subpar tokenizer
I didn't mean to close.