tokenizers
tokenizers copied to clipboard
Tokenizers for Node 16?
Currently using node 16 and I'm not able to use the latest npm release. Is it possible to rebuild for v15+?
Hi, this will require an update of neon
which is the library we use for node bindings.
Unfortunately, neon
introduces a lot of breaking changes (for the better it seems) since we made the bindings and so this is going to be a relatively big endeavor to rewrite.
Any help on that department will be appreciated.
Until then, we unfortunately can't build for Node 16.
When we redo the bindings, we can also think about manylinux support: https://github.com/huggingface/tokenizers/issues/972
It looks like v14 is supported at least according to https://github.com/huggingface/tokenizers/issues/648 but the package.json file is restricting it to versions less than 14
https://github.com/huggingface/tokenizers/blob/96a9e5715c5e71ddc26f36fc456c95d729b23923/bindings/node/package.json#L38-L40
But I can confirm it won't build for Node v16 in the current state.
There are a lot of breaking changes: https://github.com/neon-bindings/neon/blob/0.10.0/MIGRATION_GUIDE_0.10.md
Would it be helpful if I wrote a script to highlight the changes and we can all start tackling them? 💪
Hi @markhughes
I think given the age of neon, it might even be more practical to start over from scratch. Maybe take the current version as a starting point but that's all.
Help on that front would be highly appreciated !
Don't hesitate to share work early if you want to tackle it !
For others blocked on this, this analysis finds that @dqbd/tiktoken
– a WASM build of OpenAI's official Rust tiktoken – works well.
@rattrayalex I'm also stuck on this. I'm using SvelteKit and wish I could use Tokenizers, but node versions won't allow it. I tried installing stuff with different node versions, but I guess that doesn't really work in a single project.
I'm still a beginner, so I would appreciate help:
I've got a tokenizer.json
and am about to start making my own function to convert the words into numbers (which is what it looks like the tokenizer is doing to me when I run it in python). Is there a way to use @dqbd/tiktoken
instead of Tokenizers but still use my tokenizer.json
? My model is based on T5, so maybe I don't even need to do this?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.