C and C++ bindings to Tokenizers
Adding in bindings for two more languages!
bindings/cppbindings/c
C is an intermediate step to bind C++ and Rust: i.e., C++ <--> C <--> Rust.
--
- Added tests to c++
- Added benchmarks for my sanity checks and the results are as expected.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Hi @ArthurZucker I've made some progress with c++ adding more APIs. But... templating like jinaj2 is giving me a bit of trouble. Wondering how your team is handling templates, e..g. chat_template -- is there a native support in the rust code for chat_template in jinja2 format?
I tried integrating https://github.com/jinja2cpp/Jinja2Cpp/ at c++ bindings layer, but some features are limited and lead to crash (IIRC, negative index like messages[-1])
Any tips/recommendation for handling templates natively would be great.
Upon digging https://github.com/huggingface/text-generation-inference codebase, I found it's using minijina in Rust to render chat templates (with some workarounds for unsupported features). So, I've removed jinja rendering at c++ bindings layer, instead moved it to tokenizer core in Rust, using minijinja2. This way, all bindings can access the functionality.
Disclaimer: I'm not proficient in Rust, and most of the code is done by AI agents (though I've tried to closely supervise it/them). Based on my testing, everything seems to work (at least for my usecase and its tests pass).
I will drop this more as a FYI: check https://github.com/mlc-ai/tokenizers-cpp
There's an existing C++ bindings from Tokenizers that goes through the same Rust -> C -> C++ path. The C++ code also binds more than tokenizers because it includes sentencepiece, but that can be cut off.
I think if you want a more battle tested code, fork tokenizers-cpp/rust into your current bindings/c folder and ship tokenizers_c.h.
Once that is done, fork tokenizers_cpp.h and huggingface_tokenizer.cc into bindings/cpp.
The LICENSE shouldn't be an issue. And I think https://chat.webllm.ai/ has a live deployment of the tokenizer with WASM.