tokenizers
tokenizers copied to clipboard
JVM - Add bindings with Java API
Related issue: #242
This a JVM/Java binding authored by me and @andreabrduque. We created it from our need to integrate NLP deeper into our (@hypefactors) data pipelines.
I assume these bindings enable more use cases from (at least) DJL and SparkNLP, and therefore wider adoption of these great tokenizers.
This work was quite a learning experience for both of us. We are both new to Rust, its memory (safety) model and Java Native Access APIs.
Approach
We went for a MVP (minimum viable product) approach by only supporting the most common use case: load HF Hub tokenizers directly from a JVM-based data pipeline.
As a consequence, this PR is by intention code-wise small and restricted to that intended functionality. Nevertheless, we designed it so it should be easy to review, merge, maintain and expand in the long-term if use cases warrant it.
To maximize memory safety, we used @getditto's safer_ffi package to easily wrap the tokenizers in a FFI friendly interface. It drastically improved dev-friendliness and code quality compared to our first approach that did memcpy and passing memory buffers around.
Note: we noticed there was a WIP branch creating Java bindings, but it seemed to have gone stale. See https://github.com/huggingface/tokenizers/tree/java-binding
Tests
There are a handful unit tests on the Java side to help detecting regressions in the binding.
Performance
We use Java Native Access (JNA) for the Java side of FFI. We also considered JNI and JNR. We found that JNA has a strong balance on a big community behind it, as well as providing sufficient performance.
To validate performance, we build a small micro benchmark using JMH (Java Microbenchmark Harness), the defacto framework for this in JVM land using norvig's big.txt file. Our results show that a Macbook Pro 15" 2018 can tokenize around 2 MB/sec for bert-base-cased
.
Future work
Check out README.md in the PR.
Thank you for reviewing!
This is really cool!
It looks like the code from this PR is currently not building properly. Output of ./gradlew compileJava
includes this warning and then error:
Compiling tokenizers v0.11.0 (/home/kwa/Projects/other/hypefactors/tokenizers-project/tokenizers/tokenizers)
warning: fields `bos_id` and `eos_id` are never read
--> /home/kwa/Projects/other/hypefactors/tokenizers-project/tokenizers/tokenizers/src/models/unigram/lattice.rs:59:5
|
53 | pub struct Lattice<'a> {
| ------- fields in this struct
...
59 | bos_id: usize,
| ^^^^^^^^^^^^^
60 | eos_id: usize,
| ^^^^^^^^^^^^^
|
= note: `#[warn(dead_code)]` on by default
= note: `Lattice` has a derived impl for the trait `Debug`, but this is intentionally ignored during dead code analysis
warning: `tokenizers` (lib) generated 1 warning
Compiling safer-ffi-tokenizers v0.1.0 (/home/kwa/Projects/other/hypefactors/tokenizers-project/tokenizers/bindings/jvm/lib/src/main/rust)
error[E0623]: lifetime mismatch
--> src/lib.rs:173:4
|
173 | fn encode_batch(
| ^^^^^^^^^^^^ ...but data from `ffi_input` flows into `ffi_input` here
174 | it: &FFITokenizer,
175 | ffi_input: &repr_c::Vec<char_p::Ref>,
| -------------------------
| |
| these two types are declared with different lifetimes...
For more information about this error, try `rustc --explain E0623`.
error: could not compile `safer-ffi-tokenizers` due to previous error
> Task :buildRust FAILED