tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Tokenizers for Node 16?

Open etiennelunetta opened this issue 2 years ago • 7 comments

Currently using node 16 and I'm not able to use the latest npm release. Is it possible to rebuild for v15+?

etiennelunetta avatar Feb 17 '22 09:02 etiennelunetta

Hi, this will require an update of neon which is the library we use for node bindings.

Unfortunately, neon introduces a lot of breaking changes (for the better it seems) since we made the bindings and so this is going to be a relatively big endeavor to rewrite.

Any help on that department will be appreciated.

Until then, we unfortunately can't build for Node 16.

Narsil avatar Feb 17 '22 10:02 Narsil

When we redo the bindings, we can also think about manylinux support: https://github.com/huggingface/tokenizers/issues/972

Narsil avatar Apr 04 '22 10:04 Narsil

It looks like v14 is supported at least according to https://github.com/huggingface/tokenizers/issues/648 but the package.json file is restricting it to versions less than 14

https://github.com/huggingface/tokenizers/blob/96a9e5715c5e71ddc26f36fc456c95d729b23923/bindings/node/package.json#L38-L40

markhughes avatar Apr 03 '23 23:04 markhughes

But I can confirm it won't build for Node v16 in the current state.

There are a lot of breaking changes: https://github.com/neon-bindings/neon/blob/0.10.0/MIGRATION_GUIDE_0.10.md

Would it be helpful if I wrote a script to highlight the changes and we can all start tackling them? 💪

markhughes avatar Apr 03 '23 23:04 markhughes

Hi @markhughes

I think given the age of neon, it might even be more practical to start over from scratch. Maybe take the current version as a starting point but that's all.

Help on that front would be highly appreciated !

Don't hesitate to share work early if you want to tackle it !

Narsil avatar Apr 04 '23 06:04 Narsil

For others blocked on this, this analysis finds that @dqbd/tiktoken – a WASM build of OpenAI's official Rust tiktoken – works well.

rattrayalex avatar Apr 24 '23 00:04 rattrayalex

@rattrayalex I'm also stuck on this. I'm using SvelteKit and wish I could use Tokenizers, but node versions won't allow it. I tried installing stuff with different node versions, but I guess that doesn't really work in a single project. I'm still a beginner, so I would appreciate help: I've got a tokenizer.json and am about to start making my own function to convert the words into numbers (which is what it looks like the tokenizer is doing to me when I run it in python). Is there a way to use @dqbd/tiktoken instead of Tokenizers but still use my tokenizer.json? My model is based on T5, so maybe I don't even need to do this?

MatousAc avatar May 03 '23 18:05 MatousAc

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Feb 27 '24 01:02 github-actions[bot]