Nicolas Patry
Nicolas Patry
Hi, this will require an update of `neon` which is the library we use for node bindings. Unfortunately, `neon` introduces a lot of breaking changes (for the better it seems)...
When we redo the bindings, we can also think about manylinux support: https://github.com/huggingface/tokenizers/issues/972
Could you maybe share some code on how to reproduce the issue starting from an existing tokenizer ? Currently it's hard to understand what's going on.
Nice ! Still a few lints :)
> as well as splitting into words". I think it would be better if all it did was replace bytes and left splitting to another pre_tokenizer step. We're in luck,...
> I don't think .add_tokens() is implemented. https://github.com/huggingface/transformers/blob/cad61b68396a1a387287a8e2e2fef78a25b79383/src/transformers/tokenization_utils_base.py#L952 You are pointing to a base class, so yes it's not implemented. Real class: https://github.com/huggingface/transformers/blob/cad61b68396a1a387287a8e2e2fef78a25b79383/src/transformers/tokenization_utils_fast.py#L264
Do you mind creating an actions where this actually ran to check what actually happened. Just comment out the upload to pypi part, so we can iterate on it.
Arm64 node build can be ignored. If you can just remove it from the automated build that'd be ok :) We need to rewrite the node bindings at some point...
> Does anyone use them ? There are some issues open yes, but as we're lagging behind in node versions it's harder and harder to use them https://github.com/huggingface/tokenizers/issues?q=is%3Aissue+is%3Aopen+node
Let's close in favor of https://github.com/huggingface/tokenizers/pull/1055 if you're OK with it.