tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Support `wasm`

Open Narsil opened this issue 3 years ago • 9 comments

Wasm support would be a cool issue.

  • [ ] Add a feature flag wasm.
  • [ ] Use esaxx_rs::suffix_rs instead of esaxx_rs::suffix (Maybe with a change in the crate features too to prevent C compilation)
  • [ ] Find a good workaround onig compilation. Dropping support for any element that depends on it is the simplest, Using something like emscripten might be done too https://github.com/rustwasm/team/issues/291 . Big conversation around this: https://github.com/huggingface/tokenizers/issues/63
  • [ ] Add a simple example on how to use when it works.

Narsil avatar Mar 01 '22 11:03 Narsil

Hey, I need this feature. Is someone working on it? Otherwise I can take it over if you still want it.

mbrunel avatar May 30 '22 12:05 mbrunel

@mbrunel would love some help.

If you want to get started, some discussions have been happening here.

https://github.com/huggingface/tokenizers/issues/63

The main roadblock seems to be the regex engine @josephrocca found that fancy-regex was similar to the C onigruma we use so we might be able to support.

From memory, I think the best way to go forward might just be to name the feature unstable_wasm maybe and use fancy-regex. The goal would be to have something that's working but be very clear to users that the feature might not be 100% compliant with the rest of the lib.

Cheers.

Narsil avatar Jun 01 '22 16:06 Narsil

Thanks for the answer, I did this : https://github.com/mithril-security/tokenizers-wasm.git It compiles the whole library to wasm (using josephrocca's method for the regex abstraction) but only exposes very few features, nonetheless it has the merits of :

  • 1 : working (seemingly)
  • 2 : answering my original use-case

Do you think this is the way to go? And if so do you think it would be of interest for people if I (or others) were to do more?

mbrunel avatar Jun 04 '22 03:06 mbrunel

Hi @mbrunel ,

Actually I started some work in #1009 to integrate fully the work as this feature seemed to have more traction that I expected (and it was a fun coding thing to do).

Main differences is pure_rust is not nice I think since it subtracts from esaxx-rs so I went instead with a default cpp feature, that I disable for wasm.

Then I went with something similar for the regex abstraction but I tried to remove the copy and have as little change as much as possible. I didn't try all features within the wasm example project. Your example pulling a tokenizer and running it is nice, would you be ok if I stole it ?

I included the unstable_wasm as an examples but maybe we could also point to your tokenizers-wasm entire directory if you want to add more features and make it more like a binding to the entire API. The example is here only to provide a starting point right now.

I took the liberty of co authoring both you and @josephrocca since I clearly stole some ideas.

Feel free to comment on the PR directly if there's any questions or things that I didn't do properly, I very much just followed the wasm tutorial.

Narsil avatar Jun 06 '22 19:06 Narsil

Actually I started some work in https://github.com/huggingface/tokenizers/pull/1009 to integrate fully the work as this feature seemed to have more traction that I expected (and it was a fun coding thing to do).

Nice

we could also point to your tokenizers-wasm entire directory if you want to add more features and make it more like a binding to the entire API.

I think this woud be the best. We're planning to do that, but we're also having other priorities. I think I'll continue to work on it from time to time. I'll also make the repo more contribution-friendly so that people can expose features if they need them.

mbrunel avatar Jun 07 '22 08:06 mbrunel

How about once the pr is merged we write something to inform the Rust/Ai community that the feature now exists and why it can be usefull (for security reasons in our case).

mbrunel avatar Jun 08 '22 15:06 mbrunel

Hi @Narsil With @mbrunel we wrote an article to explain the motivation of porting Tokenizers to the client side, plus we added insights about to do it. Could you have a look and give us some feedback? I was wondering if there would be some opportunity to communicate on this together to the Rust and AI communities :)

dhuynh95 avatar Jul 05 '22 05:07 dhuynh95

that's really cool @dhuynh95 @mbrunel!!

julien-c avatar Jul 08 '22 09:07 julien-c

Hey @dhuynh95 It's actually pretty cool !! Congrats.

I didn't even realize your use case was running things in an enclave. Thanks for working in re-adding other layers missing from wasm (because we can't have them :)) like from pretrained.

Narsil avatar Jul 15 '22 14:07 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Feb 22 '24 01:02 github-actions[bot]