tokenizers JVM - Add bindings with Java API

Related issue: #242

This a JVM/Java binding authored by me and @andreabrduque. We created it from our need to integrate NLP deeper into our (@hypefactors) data pipelines.

I assume these bindings enable more use cases from (at least) DJL and SparkNLP, and therefore wider adoption of these great tokenizers.

This work was quite a learning experience for both of us. We are both new to Rust, its memory (safety) model and Java Native Access APIs.

Approach

We went for a MVP (minimum viable product) approach by only supporting the most common use case: load HF Hub tokenizers directly from a JVM-based data pipeline.

As a consequence, this PR is by intention code-wise small and restricted to that intended functionality. Nevertheless, we designed it so it should be easy to review, merge, maintain and expand in the long-term if use cases warrant it.

To maximize memory safety, we used @getditto's safer_ffi package to easily wrap the tokenizers in a FFI friendly interface. It drastically improved dev-friendliness and code quality compared to our first approach that did memcpy and passing memory buffers around.

Note: we noticed there was a WIP branch creating Java bindings, but it seemed to have gone stale. See https://github.com/huggingface/tokenizers/tree/java-binding

Tests

There are a handful unit tests on the Java side to help detecting regressions in the binding.

Performance

We use Java Native Access (JNA) for the Java side of FFI. We also considered JNI and JNR. We found that JNA has a strong balance on a big community behind it, as well as providing sufficient performance.

To validate performance, we build a small micro benchmark using JMH (Java Microbenchmark Harness), the defacto framework for this in JVM land using norvig's big.txt file. Our results show that a Macbook Pro 15" 2018 can tokenize around 2 MB/sec for bert-base-cased.

Future work

Check out README.md in the PR.

Thank you for reviewing!

Dec 07 '21 08:12 nguyenvietyen

This is really cool!

Dec 07 '21 10:12 julien-c

It looks like the code from this PR is currently not building properly. Output of ./gradlew compileJava includes this warning and then error:

   Compiling tokenizers v0.11.0 (/home/kwa/Projects/other/hypefactors/tokenizers-project/tokenizers/tokenizers)
warning: fields `bos_id` and `eos_id` are never read
  --> /home/kwa/Projects/other/hypefactors/tokenizers-project/tokenizers/tokenizers/src/models/unigram/lattice.rs:59:5
   |
53 | pub struct Lattice<'a> {
   |            ------- fields in this struct
...
59 |     bos_id: usize,
   |     ^^^^^^^^^^^^^
60 |     eos_id: usize,
   |     ^^^^^^^^^^^^^
   |
   = note: `#[warn(dead_code)]` on by default
   = note: `Lattice` has a derived impl for the trait `Debug`, but this is intentionally ignored during dead code analysis

warning: `tokenizers` (lib) generated 1 warning
   Compiling safer-ffi-tokenizers v0.1.0 (/home/kwa/Projects/other/hypefactors/tokenizers-project/tokenizers/bindings/jvm/lib/src/main/rust)
error[E0623]: lifetime mismatch
   --> src/lib.rs:173:4
    |
173 | fn encode_batch(
    |    ^^^^^^^^^^^^ ...but data from `ffi_input` flows into `ffi_input` here
174 |     it: &FFITokenizer,
175 |     ffi_input: &repr_c::Vec<char_p::Ref>,
    |                -------------------------
    |                             |
    |                             these two types are declared with different lifetimes...

For more information about this error, try `rustc --explain E0623`.
error: could not compile `safer-ffi-tokenizers` due to previous error

> Task :buildRust FAILED

Sep 09 '22 16:09 kwalcock

tokenizers tokenizers copied to clipboard

JVM - Add bindings with Java API

Approach

Tests

Performance

Future work

tokenizers
tokenizers copied to clipboard