tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Getting `tokenizers` working in Pyodide/JupyterLite

Open josephrocca opened this issue 3 years ago • 9 comments
trafficstars

Given that @Narsil has got wasm working in #1009, I've been looking into getting the Python bindings working in the browser with Pyodide. This would allow tokenizers to be used in JupyterLite.

Here's the general process to getting a Python package working in the browser with Pyodide (explained here):

git clone https://github.com/pyodide/pyodide && cd pyodide
./run_docker --pre-built # this mounts current directory as /src
make  # this takes several minutes, unfortunately

# install rust - only necessary temporarily until Pyodide's Docker image is updated to include Rust
sudo apt update
sudo apt install curl
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
rustup target add wasm32-unknown-emscripten

pip install ./pyodide-build
python -m pyodide_build mkpkg tokenizers
cd packages/tokenizers
python -m pyodide_build buildpkg meta.yaml

The pyodide_build mkpkg tokenizers command generates the packages/tokenizers/meta.yaml file which is a "recipe" for building the Pyodide package.

For simple packages, mkpkg often generates a recipe that works fine, and buildpkg successfully builds it, and we're done - the Python package (which can include C++ and Rust code) now works in the browser. But often some edits need to be made to "patch" the code so that it's compatible with emscripten (C++) or wasm-pack (Rust).

In the case of tokenizers, several edits are needed. Here's the full process that I've got so far for tokenizers (I've just started clean and replicated the whole process with the commands below, so this should be a solid reproduction):

git clone https://github.com/pyodide/pyodide && cd pyodide
./run_docker --pre-built
make
sudo apt update
sudo apt install curl
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
rustup target add wasm32-unknown-emscripten

# currently-published version of tokenizers doesn't work (I think due to old pyo3 version), so we build the latest version:
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
pip install --upgrade pip
pip install setuptools_rust
sudo apt-get install pkg-config libssl-dev
python setup.py install --user

And now create /src/packages/tokenizers/meta.yaml with this content (we create it manually instead of using mkpkg tokenizers):

package:
  name: tokenizers
  version: 0.12.1
source:  
  path: /src/tokenizers/bindings/python
build:
  script: |
    source $CARGO_HOME/env
    export RUST_BACKTRACE=1

    # use the latest version of tokenizers that we built above (the version numbers here are a lie)
    cd ../
    rm -rf tokenizers-0.12.1
    cp -r /src/tokenizers/bindings/python tokenizers-0.12.1
    cp -r /src/tokenizers/tokenizers tokenizers-0.12.1/tokenizers-lib
    cd tokenizers-0.12.1

    # add wasm_unstable feature to Cargo.toml
    rm Cargo.toml
    wget https://gist.githubusercontent.com/josephrocca/19e97219085d6cee13558a75327d56d5/raw/0f978e0701011b4b8ca864f28e7ba42cfd3c0da4/Cargo.toml
test:
  imports:
    - tokenizers
about:
  home: https://github.com/huggingface/tokenizers
  PyPI: https://pypi.org/project/tokenizers
  summary: Fast and Customizable Tokenizers
  license: Apache License 2.0

For example, like this (mainly spelling this out so concretely/clearly for my future self):

mkdir /src/packages/tokenizers
sudo apt-get install nano
nano /src/packages/tokenizers/meta.yaml # then paste the above yaml code and `ctrl+x` to save

The build.script code is run before the building is started, and here I'm using it to swap out the 0.12.1 Python package with the one we built earlier, and also swap out Cargo.toml with one that has the wasm_unstable feature added. Note that I tried to build the new Python package within build.script, but ran into some troubles (most likely due to environment variables that are set before built.script is executed), which is why I've built it separately, and then pulled it into the build process in build.script. We can work out a more elegant process later (ideally we won't need to because a new version of tokenizers will have been published to pypi by then).

Now you can run this to build the package:

cd /src/packages/tokenizers
python -m pyodide_build buildpkg meta.yaml

So here's where I'm stuck. I get lots of successful lines like this:

Logs
   Compiling pyo3-macros v0.16.2
     Running `rustc --crate-name pyo3_macros --edition=2018 /src/.docker_home/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-macros-0.16.2/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type proc-macro --emit=dep-info,link -C prefer-dynamic -C embed-bitcode=no -C debug-assertions=off --cfg 'feature="pyproto"' -C metadata=5b07ef916bfa56f2 -C extra-filename=-5b07ef916bfa56f2 --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern proc_macro2=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libproc_macro2-c5bda9837f8142a3.rlib --extern pyo3_macros_backend=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libpyo3_macros_backend-41fc306ceecb7ee7.rlib --extern quote=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libquote-13b403b6b5fcf3ce.rlib --extern syn=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libsyn-4e1a1d290716f2ae.rlib --extern proc_macro --cap-lints allow`
     Running `rustc --crate-name onig_sys /src/.docker_home/.cargo/registry/src/github.com-1ecc6299db9ec823/onig_sys-69.7.1/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no -C metadata=e90787dc11c99831 -C extra-filename=-e90787dc11c99831 --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps --target wasm32-unknown-emscripten -C linker=emcc -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --cap-lints allow -C relocation-model=pic -C target-feature=+mutable-globals -C link-arg=-sSIDE_MODULE=1 -C link-arg=-sWASM_BIGINT -L native=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/build/onig_sys-3f409bb3dbe1c5f7/out -l static=onig`
   Compiling darling v0.10.2
     Running `rustc --crate-name darling /src/.docker_home/.cargo/registry/src/github.com-1ecc6299db9ec823/darling-0.10.2/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debug-assertions=off --cfg 'feature="default"' --cfg 'feature="suggestions"' -C metadata=119243efa3c6725c -C extra-filename=-119243efa3c6725c --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern darling_core=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libdarling_core-c8c51798808b0c33.rmeta --extern darling_macro=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libdarling_macro-0172bb77f0b9ad99.so --cap-lints allow`
   Compiling derive_builder_core v0.9.0
     Running `rustc --crate-name derive_builder_core /src/.docker_home/.cargo/registry/src/github.com-1ecc6299db9ec823/derive_builder_core-0.9.0/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debug-assertions=off -C metadata=53ddc475b59843e0 -C extra-filename=-53ddc475b59843e0 --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern darling=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libdarling-119243efa3c6725c.rmeta --extern proc_macro2=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libproc_macro2-c5bda9837f8142a3.rmeta --extern quote=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libquote-13b403b6b5fcf3ce.rmeta --extern syn=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libsyn-4e1a1d290716f2ae.rmeta --cap-lints allow`
   Compiling onig v6.3.1
     Running `rustc --crate-name onig /src/.docker_home/.cargo/registry/src/github.com-1ecc6299db9ec823/onig-6.3.1/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no -C metadata=d7600d5c8c71ef9a -C extra-filename=-d7600d5c8c71ef9a --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps --target wasm32-unknown-emscripten -C linker=emcc -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern bitflags=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libbitflags-5632235deb7d1ae0.rmeta --extern lazy_static=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/liblazy_static-9c07bdf0aed09579.rmeta --extern onig_sys=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libonig_sys-e90787dc11c99831.rmeta --cap-lints allow -C relocation-model=pic -C target-feature=+mutable-globals -C link-arg=-sSIDE_MODULE=1 -C link-arg=-sWASM_BIGINT -L native=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/build/onig_sys-3f409bb3dbe1c5f7/out`
   Compiling thiserror v1.0.30
     Running `rustc --crate-name thiserror --edition=2018 /src/.docker_home/.cargo/registry/src/github.com-1ecc6299db9ec823/thiserror-1.0.30/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no -C metadata=b8ca63797ef8fa74 -C extra-filename=-b8ca63797ef8fa74 --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps --target wasm32-unknown-emscripten -C linker=emcc -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern thiserror_impl=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libthiserror_impl-4ceed314347ddd52.so --cap-lints allow -C relocation-model=pic -C target-feature=+mutable-globals -C link-arg=-sSIDE_MODULE=1 -C link-arg=-sWASM_BIGINT`
   Compiling derive_builder v0.9.0
     Running `rustc --crate-name derive_builder /src/.docker_home/.cargo/registry/src/github.com-1ecc6299db9ec823/derive_builder-0.9.0/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type proc-macro --emit=dep-info,link -C prefer-dynamic -C embed-bitcode=no -C debug-assertions=off -C metadata=784fade23d4deb0b -C extra-filename=-784fade23d4deb0b --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern darling=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libdarling-119243efa3c6725c.rlib --extern derive_builder_core=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libderive_builder_core-53ddc475b59843e0.rlib --extern proc_macro2=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libproc_macro2-c5bda9837f8142a3.rlib --extern quote=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libquote-13b403b6b5fcf3ce.rlib --extern syn=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libsyn-4e1a1d290716f2ae.rlib --extern proc_macro --cap-lints allow`
     Running `rustc --crate-name pyo3 --edition=2018 /src/.docker_home/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.16.2/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --cfg 'feature="auto-initialize"' --cfg 'feature="default"' --cfg 'feature="extension-module"' --cfg 'feature="indoc"' --cfg 'feature="macros"' --cfg 'feature="pyo3-macros"' --cfg 'feature="pyproto"' --cfg 'feature="unindent"' -C metadata=764d919335d9522c -C extra-filename=-764d919335d9522c --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps --target wasm32-unknown-emscripten -C linker=emcc -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern cfg_if=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libcfg_if-39adf5d2ff64c75c.rmeta --extern indoc=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libindoc-b448fe197acc8f44.so --extern libc=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/liblibc-b8b05084e31ae039.rmeta --extern parking_lot=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libparking_lot-45e5ede053a9abcc.rmeta --extern pyo3_ffi=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libpyo3_ffi-cba1e47a40b1143f.rmeta --extern pyo3_macros=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libpyo3_macros-5b07ef916bfa56f2.so --extern unindent=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libunindent-55b87c7ccdc5e6be.rmeta --cap-lints allow -C relocation-model=pic -C target-feature=+mutable-globals -C link-arg=-sSIDE_MODULE=1 -C link-arg=-sWASM_BIGINT --cfg Py_3_6 --cfg Py_3_7 --cfg Py_3_8 --cfg Py_3_9 --cfg Py_3_10 --cfg min_const_generics --cfg addr_of`
     Running `rustc --crate-name serde /src/.docker_home/.cargo/registry/src/github.com-1ecc6299db9ec823/serde-1.0.136/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --cfg 'feature="default"' --cfg 'feature="derive"' --cfg 'feature="rc"' --cfg 'feature="serde_derive"' --cfg 'feature="std"' -C metadata=1c74096bc385e310 -C extra-filename=-1c74096bc385e310 --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps --target wasm32-unknown-emscripten -C linker=emcc -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern serde_derive=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libserde_derive-2b3a7b6e96a81c45.so --cap-lints allow -C relocation-model=pic -C target-feature=+mutable-globals -C link-arg=-sSIDE_MODULE=1 -C link-arg=-sWASM_BIGINT --cfg no_std_atomic64`
   Compiling numpy v0.16.2
     Running `rustc --crate-name numpy --edition=2018 /src/.docker_home/.cargo/registry/src/github.com-1ecc6299db9ec823/numpy-0.16.2/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no -C metadata=12822533dda49f7e -C extra-filename=-12822533dda49f7e --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps --target wasm32-unknown-emscripten -C linker=emcc -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern libc=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/liblibc-b8b05084e31ae039.rmeta --extern ndarray=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libndarray-93d3764271624106.rmeta --extern num_complex=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libnum_complex-16d9ca9d7783c986.rmeta --extern num_traits=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libnum_traits-4d31e66a498aec6a.rmeta --extern pyo3=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libpyo3-764d919335d9522c.rmeta --cap-lints allow -C relocation-model=pic -C target-feature=+mutable-globals -C link-arg=-sSIDE_MODULE=1 -C link-arg=-sWASM_BIGINT`
   Compiling spm_precompiled v0.1.3
     Running `rustc --crate-name spm_precompiled --edition=2018 /src/.docker_home/.cargo/registry/src/github.com-1ecc6299db9ec823/spm_precompiled-0.1.3/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no -C metadata=b8aea599db749a5d -C extra-filename=-b8aea599db749a5d --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps --target wasm32-unknown-emscripten -C linker=emcc -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern base64=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libbase64-5d9d2f2b2c7e1e37.rmeta --extern nom=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libnom-feb94d44a9367866.rmeta --extern serde=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libserde-1c74096bc385e310.rmeta --extern unicode_segmentation=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libunicode_segmentation-7d97ae18053aad3a.rmeta --cap-lints allow -C relocation-model=pic -C target-feature=+mutable-globals -C link-arg=-sSIDE_MODULE=1 -C link-arg=-sWASM_BIGINT`
     Running `rustc --crate-name serde_json --edition=2018 /src/.docker_home/.cargo/registry/src/github.com-1ecc6299db9ec823/serde_json-1.0.79/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --cfg 'feature="default"' --cfg 'feature="std"' -C metadata=8df712117eafd652 -C extra-filename=-8df712117eafd652 --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps --target wasm32-unknown-emscripten -C linker=emcc -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern itoa=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libitoa-9d1e83f5d4d299cd.rmeta --extern ryu=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libryu-32d5fb44db9ca404.rmeta --extern serde=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libserde-1c74096bc385e310.rmeta --cap-lints allow -C relocation-model=pic -C target-feature=+mutable-globals -C link-arg=-sSIDE_MODULE=1 -C link-arg=-sWASM_BIGINT --cfg limb_width_32`
   Compiling tokenizers v0.12.1 (/src/packages/tokenizers/build/tokenizers-0.12.1/tokenizers-lib)
     Running `rustc --crate-name tokenizers --edition=2018 tokenizers-lib/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --cfg 'feature="fancy-regex"' --cfg 'feature="unstable_wasm"' -C metadata=ede17f57d7babcc9 -C extra-filename=-ede17f57d7babcc9 --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps --target wasm32-unknown-emscripten -C linker=emcc -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern aho_corasick=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libaho_corasick-d1615d17f383f3b5.rmeta --extern derive_builder=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libderive_builder-784fade23d4deb0b.so --extern dirs=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libdirs-48221ea029a65e0e.rmeta --extern esaxx_rs=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libesaxx_rs-ac8bfe6d7bb274c9.rmeta --extern fancy_regex=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libfancy_regex-1606ca910fa46f44.rmeta --extern getrandom=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libgetrandom-d77dbc101b0dff46.rmeta --extern itertools=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libitertools-405276f5f655a29e.rmeta --extern lazy_static=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/liblazy_static-9c07bdf0aed09579.rmeta --extern log=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/liblog-649476918be73895.rmeta --extern macro_rules_attribute=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libmacro_rules_attribute-e04458b783277716.rmeta --extern paste=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps/libpaste-839cb5693f492733.so --extern rand=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/librand-7a19530e19512c47.rmeta --extern rayon=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/librayon-ffcdc55b004a5ef6.rmeta --extern rayon_cond=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/librayon_cond-0e5b2ecc005b605d.rmeta --extern regex=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libregex-915dfab8e08f8c98.rmeta --extern regex_syntax=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libregex_syntax-87c4bc2bd08c4421.rmeta --extern serde=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libserde-1c74096bc385e310.rmeta --extern serde_json=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libserde_json-8df712117eafd652.rmeta --extern spm_precompiled=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libspm_precompiled-b8aea599db749a5d.rmeta --extern thiserror=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libthiserror-b8ca63797ef8fa74.rmeta --extern unicode_normalization_alignments=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libunicode_normalization_alignments-7bda948736bec2c5.rmeta --extern unicode_segmentation=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libunicode_segmentation-7d97ae18053aad3a.rmeta --extern unicode_categories=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libunicode_categories-dd1bc265a4b68c46.rmeta -C relocation-model=pic -C target-feature=+mutable-globals -C link-arg=-sSIDE_MODULE=1 -C link-arg=-sWASM_BIGINT`

But then I get this error:

Logs
warning: `tokenizers` (lib) generated 1 warning
   Compiling tokenizers-python v0.11.0 (/src/packages/tokenizers/build/tokenizers-0.12.1)
     Running `rustc --crate-name tokenizers --edition=2018 src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type cdylib --emit=dep-info,link -C opt-level=3 -C embed-bitcode=no --crate-type cdylib -C linker=emcc --cfg 'feature="default"' -C metadata=59f8f728cabf798f --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps --target wasm32-unknown-emscripten -C linker=emcc -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern env_logger=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libenv_logger-2915b589adaafb8a.rlib --extern itertools=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libitertools-405276f5f655a29e.rlib --extern libc=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/liblibc-b8b05084e31ae039.rlib --extern ndarray=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libndarray-167cf6f3b00ab042.rlib --extern numpy=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libnumpy-12822533dda49f7e.rlib --extern onig=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libonig-d7600d5c8c71ef9a.rlib --extern pyo3=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libpyo3-764d919335d9522c.rlib --extern rayon=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/librayon-ffcdc55b004a5ef6.rlib --extern serde=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libserde-1c74096bc385e310.rlib --extern serde_json=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libserde_json-8df712117eafd652.rlib --extern tokenizers=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libtokenizers-ede17f57d7babcc9.rlib -C relocation-model=pic -C target-feature=+mutable-globals -C link-arg=-sSIDE_MODULE=1 -C link-arg=-sWASM_BIGINT -L native=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/build/onig_sys-3f409bb3dbe1c5f7/out`
error[E0422]: cannot find struct, variant or union type `FromPretrainedParameters` in crate `tk`
   --> src/tokenizer.rs:572:26
    |
572 |         let params = tk::FromPretrainedParameters {
    |                          ^^^^^^^^^^^^^^^^^^^^^^^^ not found in `tk`

error[E0425]: cannot find function `pthread_atfork` in crate `libc`
   --> src/lib.rs:136:19
    |
136 |             libc::pthread_atfork(None, None, Some(child_after_fork));
    |                   ^^^^^^^^^^^^^^ not found in `libc`

error[E0599]: the method `find_matches` exists for reference `&onig::Regex`, but its trait bounds were not satisfied
   --> src/utils/normalization.rs:32:61
    |
32  |                 Python::with_gil(|py| (&r.borrow(py).inner).find_matches(inside))
    |                                                             ^^^^^^^^^^^^ method cannot be called on `&onig::Regex` due to unsatisfied trait bounds
    |
   ::: /src/.docker_home/.cargo/registry/src/github.com-1ecc6299db9ec823/onig-6.3.1/src/lib.rs:147:1
    |
147 | pub struct Regex {
    | ----------------
    | |
    | doesn't satisfy `<onig::Regex as FnOnce<(char,)>>::Output = bool`
    | doesn't satisfy `onig::Regex: Fn<(char,)>`
    | doesn't satisfy `onig::Regex: tk::pattern::Pattern`
    |
    = note: the following trait bounds were not satisfied:
            `<onig::Regex as FnOnce<(char,)>>::Output = bool`
            which is required by `onig::Regex: tk::pattern::Pattern`
            `onig::Regex: Fn<(char,)>`
            which is required by `onig::Regex: tk::pattern::Pattern`
            `<&onig::Regex as FnOnce<(char,)>>::Output = bool`
            which is required by `&onig::Regex: tk::pattern::Pattern`
            `&onig::Regex: Fn<(char,)>`
            which is required by `&onig::Regex: tk::pattern::Pattern`

error[E0599]: no function or associated item named `from_pretrained` found for struct `TokenizerImpl` in the current scope
   --> src/tokenizer.rs:582:35
    |
582 |             ToPyResult(Tokenizer::from_pretrained(identifier, Some(params))).into();
    |                                   ^^^^^^^^^^^^^^^ function or associated item not found in `TokenizerImpl<PyModel, PyNormalizer, PyPreTokenizer, PyPostProcessor, PyDecoder>`

Some errors have detailed explanations: E0422, E0425, E0599.
For more information about an error, try `rustc --explain E0422`.
error: could not compile `tokenizers-python` due to 4 previous errors

Caused by:
  process didn't exit successfully: `rustc --crate-name tokenizers --edition=2018 src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type cdylib --emit=dep-info,link -C opt-level=3 -C embed-bitcode=no --crate-type cdylib -C linker=emcc --cfg 'feature="default"' -C metadata=59f8f728cabf798f --out-dir /src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps --target wasm32-unknown-emscripten -C linker=emcc -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps -L dependency=/src/packages/tokenizers/build/tokenizers-0.12.1/target/release/deps --extern env_logger=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libenv_logger-2915b589adaafb8a.rlib --extern itertools=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libitertools-405276f5f655a29e.rlib --extern libc=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/liblibc-b8b05084e31ae039.rlib --extern ndarray=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libndarray-167cf6f3b00ab042.rlib --extern numpy=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libnumpy-12822533dda49f7e.rlib --extern onig=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libonig-d7600d5c8c71ef9a.rlib --extern pyo3=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libpyo3-764d919335d9522c.rlib --extern rayon=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/librayon-ffcdc55b004a5ef6.rlib --extern serde=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libserde-1c74096bc385e310.rlib --extern serde_json=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libserde_json-8df712117eafd652.rlib --extern tokenizers=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/deps/libtokenizers-ede17f57d7babcc9.rlib -C relocation-model=pic -C target-feature=+mutable-globals -C link-arg=-sSIDE_MODULE=1 -C link-arg=-sWASM_BIGINT -L native=/src/packages/tokenizers/build/tokenizers-0.12.1/target/wasm32-unknown-emscripten/release/build/onig_sys-3f409bb3dbe1c5f7/out` (exit status: 1)
error: cargo failed with code: 101

Since Narsil has done the work to get this compiling to Wasm already, I'm hoping this is just a simple thing that I'm missing in the Cargo.toml, or something like that. Any ideas?

josephrocca avatar Jun 14 '22 07:06 josephrocca

It also requires patching Python binding

diff --git a/bindings/python/Cargo.toml b/bindings/python/Cargo.toml
index cebc70e..817b928 100644
--- a/bindings/python/Cargo.toml
+++ b/bindings/python/Cargo.toml
@@ -23,6 +23,8 @@ itertools = "0.9"
 [dependencies.tokenizers]
 version = "*"
 path = "../../tokenizers"
+default-features = false
+features = ["unstable_wasm"]

 [dev-dependencies]
 tempfile = "3.1"
diff --git a/bindings/python/src/lib.rs b/bindings/python/src/lib.rs
index 42dd6b7..ba7a6d3 100644
--- a/bindings/python/src/lib.rs
+++ b/bindings/python/src/lib.rs
@@ -130,7 +130,7 @@ fn tokenizers(_py: Python, m: &PyModule) -> PyResult<()> {
     let _ = env_logger::try_init_from_env("TOKENIZERS_LOG");

     // Register the fork callback
-    #[cfg(target_family = "unix")]
+    #[cfg(all(target_family = "unix", not(target_family = "wasm")))]
     unsafe {
         if !REGISTERED_FORK_CALLBACK {
             libc::pthread_atfork(None, None, Some(child_after_fork));
diff --git a/bindings/python/src/tokenizer.rs b/bindings/python/src/tokenizer.rs
index e35f96c..fa0d7d0 100644
--- a/bindings/python/src/tokenizer.rs
+++ b/bindings/python/src/tokenizer.rs
@@ -561,6 +561,7 @@ impl PyTokenizer {
     ///
     /// Returns:
     ///     :class:`~tokenizers.Tokenizer`: The new tokenizer
+    #[cfg(not(target_family = "wasm"))]
     #[staticmethod]
     #[args(revision = "String::from(\"main\")", auth_token = "None")]
     #[pyo3(text_signature = "(identifier, revision=\"main\", auth_token=None)")]
diff --git a/bindings/python/src/utils/regex.rs b/bindings/python/src/utils/regex.rs
index 9e0d424..a5b2d1a 100644
--- a/bindings/python/src/utils/regex.rs
+++ b/bindings/python/src/utils/regex.rs
@@ -1,7 +1,8 @@
-use onig::Regex;
 use pyo3::exceptions;
 use pyo3::prelude::*;

+use tk::utils::SysRegex as Regex;
+
 /// Instantiate a new Regex with the given pattern
 #[pyclass(module = "tokenizers", name = "Regex")]
 #[pyo3(text_signature = "(self, pattern)")]

Build with maturin 0.13.0 beta version:

❯ RUSTUP_TOOLCHAIN=nightly maturin build --release -o dist --target wasm32-unknown-emscripten -i python3.10
⚠️  Warning: `build-backend` in pyproject.toml is not set to `maturin`, packaging tools such as pip will not use maturin to build this project.
🔗 Found pyo3 bindings
🐍 Found cross compiling target CPython 3.10
   Compiling pyo3-build-config v0.16.2
   Compiling pyo3-macros-backend v0.16.2
   Compiling pyo3-ffi v0.16.2
   Compiling pyo3 v0.16.2
   Compiling pyo3-macros v0.16.2
   Compiling numpy v0.16.2
   Compiling tokenizers-python v0.11.0 (/Users/messense/Projects/tokenizers/bindings/python)
warning: use of deprecated associated function `std::error::Error::description`: use the Display impl or to_string()
  --> src/utils/regex.rs:20:65
   |
20 |                 .map_err(|e| exceptions::PyException::new_err(e.description().to_owned()))?,
   |                                                                 ^^^^^^^^^^^
   |
   = note: `#[warn(deprecated)]` on by default


warning: static `REGISTERED_FORK_CALLBACK` is never usedzers-python
  --> src/lib.rs:26:1
   |
26 | static mut REGISTERED_FORK_CALLBACK: bool = false;
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `#[warn(dead_code)]` on by default


warning: function `child_after_fork` is never used
  --> src/lib.rs:27:15
   |
27 | extern "C" fn child_after_fork() {
   |               ^^^^^^^^^^^^^^^^


warning: 3 warnings emitted==========> ] 159/160: tokenizers-python


    Finished release [optimized] target(s) in 23.77s
📦 Built wheel for CPython 3.10 to dist/tokenizers_python-0.11.0-cp310-cp310-emscripten_3_1_14_wasm32.whl

messense avatar Jun 30 '22 06:06 messense

@messense Thank you for looking into this! I've run into an error after carefully followed your patch instructions here:

https://github.com/josephrocca/tokenizers/tree/pyodide

(Note that I also had to remove the onig dependency from bindings/python/Cargo.toml, else I get this error when running the commands below.)

To make reproduction easy, if you open a Github Codespace for the above-linked branch, and then run these commands:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
rustup target add wasm32-unknown-emscripten
rustup component add rust-src --toolchain nightly-x86_64-unknown-linux-gnu 
sudo pip install maturin==0.13.0b7
cd bindings/python
RUSTUP_TOOLCHAIN=nightly maturin build --release -o dist --target wasm32-unknown-emscripten -i python3.10

Then you get this macro_rules_attribute-related error:

Logs
@josephrocca ➜ /workspaces/tokenizers/bindings/python (pyodide ✗) $ RUSTUP_TOOLCHAIN=nightly maturin build --release -o dist --target wasm32-unknown-emscripten -i python3.10
⚠️  Warning: `build-backend` in pyproject.toml is not set to `maturin`, packaging tools such as pip will not use maturin to build this project.
🔗 Found pyo3 bindings
🐍 Found cross compiling target CPython 3.10
⚠️  Warning: The package compiler_builtins 0.1.73 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
⚠️  Warning: The package libc 0.2.126 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
⚠️  Warning: The package cc 1.0.69 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
⚠️  Warning: The package memchr 2.4.1 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
   Compiling pyo3-build-config v0.16.2
   Compiling compiler_builtins v0.1.73
   Compiling libc v0.2.126
   Compiling syn v1.0.89
  ⚠️  Warning: The package libc 0.2.126 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
   Compiling alloc v0.0.0 (/home/codespace/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc)
   Compiling cfg-if v0.1.10
⚠️  Warning: The package cfg-if 0.1.10 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
   Compiling memchr v2.4.1
⚠️  Warning: The package compiler_builtins 0.1.73 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
   Compiling adler v0.2.3
⚠️  Warning: The package adler 0.2.3 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
   Compiling rustc-demangle v0.1.21
⚠️  Warning: The package memchr 2.4.1 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
   Compiling pyo3-ffi v0.16.2
   Compiling pyo3 v0.16.2
   Compiling unwind v0.0.0 (/home/codespace/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/unwind)
   Compiling rustc-std-workspace-alloc v1.99.0 (/home/codespace/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/rustc-std-workspace-alloc)
⚠️  Warning: The package rustc-demangle 0.1.21 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
   Compiling panic_abort v0.0.0 (/home/codespace/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/panic_abort)
   Compiling panic_unwind v0.0.0 (/home/codespace/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/panic_unwind)
   Compiling gimli v0.25.0
   Compiling miniz_oxide v0.4.0
   Compiling hashbrown v0.12.0
⚠️  Warning: The package hashbrown 0.12.0 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
   Compiling object v0.26.2
⚠️  Warning: The package miniz_oxide 0.4.0 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
   Compiling std_detect v0.1.5 (/home/codespace/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/stdarch/crates/std_detect)
   Compiling darling_core v0.10.2
⚠️  Warning: The package gimli 0.25.0 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
   Compiling pyo3-macros-backend v0.16.2
   Compiling addr2line v0.16.0
⚠️  Warning: The package object 0.26.2 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
   Compiling serde_derive v1.0.136
⚠️  Warning: The package addr2line 0.16.0 (registry+https://github.com/rust-lang/crates.io-index) wasn't listed in `cargo metadata`
   Compiling thiserror-impl v1.0.30
   Compiling std v0.0.0 (/home/codespace/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std)
   Compiling darling_macro v0.10.2
   Compiling proc_macro v0.0.0 (/home/codespace/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/proc_macro)
   Compiling pyo3-macros v0.16.2
   Compiling darling v0.10.2
   Compiling cfg-if v1.0.0
   Compiling libc v0.2.121
   Compiling scopeguard v1.1.0
   Compiling lazy_static v1.4.0
   Compiling memchr v2.3.4
   Compiling memoffset v0.6.5
   Compiling smallvec v1.8.0
   Compiling num-traits v0.2.14
   Compiling rawpointer v0.2.1
   Compiling ryu v1.0.9
   Compiling either v1.6.1
   Compiling radium v0.5.3
   Compiling funty v1.1.0
   Compiling static_assertions v1.1.0
   Compiling tap v1.0.1
   Compiling wyz v0.2.0
   Compiling arrayvec v0.5.2
   Compiling bitflags v1.3.2
   Compiling regex-syntax v0.6.25
   Compiling ppv-lite86 v0.2.16
   Compiling bit-vec v0.6.3
   Compiling itoa v1.0.1
   Compiling base64 v0.12.3
   Compiling unicode-segmentation v1.9.0
   Compiling unindent v0.1.8
   Compiling quick-error v1.2.3
   Compiling termcolor v1.1.3
   Compiling thiserror v1.0.30
   Compiling unicode_categories v0.1.1
   Compiling macro_rules_attribute v0.0.2
error[E0432]: unresolved imports `proc_macro::macro_rules_attribute`, `proc_macro::macro_rules_derive`
 --> /home/codespace/.cargo/registry/src/github.com-1ecc6299db9ec823/macro_rules_attribute-0.0.2/src/lib.rs:8:5
  |
8 |     macro_rules_attribute,
  |     ^^^^^^^^^^^^^^^^^^^^^ no `macro_rules_attribute` in the root
9 |     macro_rules_derive,
  |     ^^^^^^^^^^^^^^^^^^ no `macro_rules_derive` in the root


error: aborting due to previous error


For more information about this error, try `rustc --explain E0432`.

error: could not compile `macro_rules_attribute` due to 2 previous errors
warning: build failed, waiting for other jobs to finish...
💥 maturin failed
  Caused by: Failed to build a native library through cargo
  Caused by: Cargo build finished with "exit status: 101": `"cargo" "rustc" "--release" "--target" "wasm32-unknown-emscripten" "--message-format" "json" "-Z" "build-std" "--lib" "--" "-C" "link-arg=-sSIDE_MODULE=2" "-C" "link-arg=-sWASM_BIGINT"`

Sorry if I've made a silly mistake here!

josephrocca avatar Jun 30 '22 11:06 josephrocca

I don't think you installed nightly toolchain correctly.

rustup toolchain install nightly
rustup target add --toolchain nightly wasm32-unknown-emscripten

If it still errors, try maturin git version:

cargo install --git https://github.com/PyO3/maturin.git -p maturin

messense avatar Jun 30 '22 12:06 messense

@messense Thanks! Per your advice, starting with a fresh Codespace of this branch: https://github.com/josephrocca/tokenizers/tree/pyodide I ran these commands:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
rustup toolchain install nightly
rustup target add --toolchain nightly wasm32-unknown-emscripten
rustup component add rust-src --toolchain nightly-x86_64-unknown-linux-gnu # without this line, I get this error: https://gist.github.com/josephrocca/7cc0757a150ced73fb267c09407e5ebe
cargo install --git https://github.com/PyO3/maturin.git maturin
cd bindings/python
RUSTUP_TOOLCHAIN=nightly maturin build --release -o dist --target wasm32-unknown-emscripten -i python3.10

I repeated the above process twice just to be sure I didn't make any mistakes, and I got this error both times (linker emcc not found):

  • https://gist.github.com/josephrocca/77666aaaf80d5798b215c726da54c2e6

Any ideas?

Edit: Also note that if I use sudo pip install maturin==0.13.0b7 instead of the cargo install --git https://github.com/PyO3/maturin.git maturin in the above process, then I get the same macro_rules_attribute that I showed in my previous comment.

josephrocca avatar Jun 30 '22 15:06 josephrocca

You needed to install emscripten since we are doing out of tree build for pyodide.

messense avatar Jun 30 '22 20:06 messense

Ah, thank you! That fixed it. There's a threading issue happening at runtime that I think might be an issue with Pyodide: https://github.com/pyodide/pyodide/issues/2816

Hopefully one of the Pyodide contributors will be able to comment on that.

josephrocca avatar Jun 30 '22 23:06 josephrocca

Opened https://github.com/huggingface/tokenizers/pull/1021 to ease the process.

messense avatar Jul 01 '22 09:07 messense

I've released maturin v0.13.0-beta.8 so you should be able to use sudo pip install maturin==0.13.0b8 instead of cargo install now.

messense avatar Jul 02 '22 17:07 messense

Thanks! Just tested - works 👍

josephrocca avatar Jul 02 '22 18:07 josephrocca

any updates on this?

Daniel-Kelvich avatar Feb 08 '23 14:02 Daniel-Kelvich

@Daniel-Kelvich

  • Original working demo: https://github.com/pyodide/pyodide/issues/2816#issuecomment-1172105687
  • Latest progress on support for more features: https://github.com/pyodide/pyodide/pull/3422

See the other threads linked in this readme for more info on various branches (e.g. non-pyodide-based wasm builds).

josephrocca avatar Feb 08 '23 15:02 josephrocca