tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

C and C++ bindings to Tokenizers

Open thammegowda opened this issue 1 month ago • 3 comments

Adding in bindings for two more languages!

  • bindings/cpp
  • bindings/c

C is an intermediate step to bind C++ and Rust: i.e., C++ <--> C <--> Rust.

--

  • Added tests to c++
  • Added benchmarks for my sanity checks and the results are as expected.

thammegowda avatar Nov 21 '25 09:11 thammegowda

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Hi @ArthurZucker I've made some progress with c++ adding more APIs. But... templating like jinaj2 is giving me a bit of trouble. Wondering how your team is handling templates, e..g. chat_template -- is there a native support in the rust code for chat_template in jinja2 format?

I tried integrating https://github.com/jinja2cpp/Jinja2Cpp/ at c++ bindings layer, but some features are limited and lead to crash (IIRC, negative index like messages[-1])

Any tips/recommendation for handling templates natively would be great.

thammegowda avatar Dec 01 '25 19:12 thammegowda

Upon digging https://github.com/huggingface/text-generation-inference codebase, I found it's using minijina in Rust to render chat templates (with some workarounds for unsupported features). So, I've removed jinja rendering at c++ bindings layer, instead moved it to tokenizer core in Rust, using minijinja2. This way, all bindings can access the functionality.

Disclaimer: I'm not proficient in Rust, and most of the code is done by AI agents (though I've tried to closely supervise it/them). Based on my testing, everything seems to work (at least for my usecase and its tests pass).

thammegowda avatar Dec 07 '25 22:12 thammegowda

I will drop this more as a FYI: check https://github.com/mlc-ai/tokenizers-cpp

There's an existing C++ bindings from Tokenizers that goes through the same Rust -> C -> C++ path. The C++ code also binds more than tokenizers because it includes sentencepiece, but that can be cut off.

I think if you want a more battle tested code, fork tokenizers-cpp/rust into your current bindings/c folder and ship tokenizers_c.h.

Once that is done, fork tokenizers_cpp.h and huggingface_tokenizer.cc into bindings/cpp.

The LICENSE shouldn't be an issue. And I think https://chat.webllm.ai/ has a live deployment of the tokenizer with WASM.

IvanIsCoding avatar Dec 14 '25 17:12 IvanIsCoding