C++ bindings
#185
Still needs to be done:
- [x] Missing components (normalizers/pre-tokenizers/etc.)
- [x] Tests (with good coverage)
- [ ] Documentation
- [ ] Example of depending on this from an external project
- [ ] Add to CI
Maybe can be in separate later PRs
- [ ] Training
- [ ] Serialization/deserialization
Writing custom components in C++ will definitely come later.
Normalizers, pre-tokenizers, and models are done. I will try to finish the remaining parts on Friday, but don't expect any more significant API changes. So if anybody wants to review what is there, now is a good time.
Thank you for this PR @alexeyr! I'll do my best to have a look in the near future!
@n1t0 I'm chasing down a weird bug with Tokenizer::add_tokens/add_special_tokens under Windows/MSVC. If I don't figure it out tomorrow, I'll probably have to leave it to a future PR. Otherwise, this seems close to done.
Huggingface really has a lack of c++ bindings. I'm working on python and c++ code. Both sides must has the same results. Github got some chinese tokenizers, but they do not work normally. I see, this pull request is stuck for some reason. I tried this code, it compiled and worked on windows perfectly. Next ~two weeks I'm going to try it on linux (at least it compiled on linux). Looking forward, when this pull request will be accepted. By the way - thanks to all the developers of Huggingface, especially alexeyr for c++ bindings.
Hey @alexeyr 👋🏻, thanks so much for bringing this C++ wrapper.
I'm especially interested in this for some projects we have at huggingface and wanted to leverage what you id here 💪🏻.
Unfortunately, I'm not sure to understand the output structure generated by cargo build, can you suggest the correct way to setup a C++ projects (CMake based, but I'll adapt to any format) in order to get all symbols resolved?
Currently the compiler complains about rust:: references not being found on my side.
Many thanks for your help & amazing work here.
Morgan