Support `wasm`
Wasm support would be a cool issue.
- [ ] Add a feature flag
wasm. - [ ] Use
esaxx_rs::suffix_rsinstead ofesaxx_rs::suffix(Maybe with a change in the crate features too to prevent C compilation) - [ ] Find a good workaround
onigcompilation. Dropping support for any element that depends on it is the simplest, Using something likeemscriptenmight be done too https://github.com/rustwasm/team/issues/291 . Big conversation around this: https://github.com/huggingface/tokenizers/issues/63 - [ ] Add a simple example on how to use when it works.
Hey, I need this feature. Is someone working on it? Otherwise I can take it over if you still want it.
@mbrunel would love some help.
If you want to get started, some discussions have been happening here.
https://github.com/huggingface/tokenizers/issues/63
The main roadblock seems to be the regex engine @josephrocca found that fancy-regex was similar to the C onigruma we use so we might be able to support.
From memory, I think the best way to go forward might just be to name the feature unstable_wasm maybe and use fancy-regex. The goal would be to have something that's working but be very clear to users that the feature might not be 100% compliant with the rest of the lib.
Cheers.
Thanks for the answer, I did this : https://github.com/mithril-security/tokenizers-wasm.git It compiles the whole library to wasm (using josephrocca's method for the regex abstraction) but only exposes very few features, nonetheless it has the merits of :
- 1 : working (seemingly)
- 2 : answering my original use-case
Do you think this is the way to go? And if so do you think it would be of interest for people if I (or others) were to do more?
Hi @mbrunel ,
Actually I started some work in #1009 to integrate fully the work as this feature seemed to have more traction that I expected (and it was a fun coding thing to do).
Main differences is pure_rust is not nice I think since it subtracts from esaxx-rs so I went instead with a default cpp feature, that I disable for wasm.
Then I went with something similar for the regex abstraction but I tried to remove the copy and have as little change as much as possible.
I didn't try all features within the wasm example project. Your example pulling a tokenizer and running it is nice, would you be ok if I stole it ?
I included the unstable_wasm as an examples but maybe we could also point to your tokenizers-wasm entire directory if you want to add more features and make it more like a binding to the entire API. The example is here only to provide a starting point right now.
I took the liberty of co authoring both you and @josephrocca since I clearly stole some ideas.
Feel free to comment on the PR directly if there's any questions or things that I didn't do properly, I very much just followed the wasm tutorial.
Actually I started some work in https://github.com/huggingface/tokenizers/pull/1009 to integrate fully the work as this feature seemed to have more traction that I expected (and it was a fun coding thing to do).
Nice
we could also point to your tokenizers-wasm entire directory if you want to add more features and make it more like a binding to the entire API.
I think this woud be the best. We're planning to do that, but we're also having other priorities. I think I'll continue to work on it from time to time. I'll also make the repo more contribution-friendly so that people can expose features if they need them.
How about once the pr is merged we write something to inform the Rust/Ai community that the feature now exists and why it can be usefull (for security reasons in our case).
Hi @Narsil With @mbrunel we wrote an article to explain the motivation of porting Tokenizers to the client side, plus we added insights about to do it. Could you have a look and give us some feedback? I was wondering if there would be some opportunity to communicate on this together to the Rust and AI communities :)
that's really cool @dhuynh95 @mbrunel!!
Hey @dhuynh95 It's actually pretty cool !! Congrats.
I didn't even realize your use case was running things in an enclave. Thanks for working in re-adding other layers missing from wasm (because we can't have them :)) like from pretrained.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.