tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

C++ bindings

Open alexeyr opened this issue 5 years ago • 5 comments

#185

Still needs to be done:

  • [x] Missing components (normalizers/pre-tokenizers/etc.)
  • [x] Tests (with good coverage)
  • [ ] Documentation
  • [ ] Example of depending on this from an external project
  • [ ] Add to CI

Maybe can be in separate later PRs

  • [ ] Training
  • [ ] Serialization/deserialization

Writing custom components in C++ will definitely come later.

alexeyr avatar Dec 09 '20 14:12 alexeyr

Normalizers, pre-tokenizers, and models are done. I will try to finish the remaining parts on Friday, but don't expect any more significant API changes. So if anybody wants to review what is there, now is a good time.

alexeyr avatar Dec 23 '20 16:12 alexeyr

Thank you for this PR @alexeyr! I'll do my best to have a look in the near future!

n1t0 avatar Jan 07 '21 14:01 n1t0

@n1t0 I'm chasing down a weird bug with Tokenizer::add_tokens/add_special_tokens under Windows/MSVC. If I don't figure it out tomorrow, I'll probably have to leave it to a future PR. Otherwise, this seems close to done.

alexeyr avatar Jan 18 '21 18:01 alexeyr

Huggingface really has a lack of c++ bindings. I'm working on python and c++ code. Both sides must has the same results. Github got some chinese tokenizers, but they do not work normally. I see, this pull request is stuck for some reason. I tried this code, it compiled and worked on windows perfectly. Next ~two weeks I'm going to try it on linux (at least it compiled on linux). Looking forward, when this pull request will be accepted. By the way - thanks to all the developers of Huggingface, especially alexeyr for c++ bindings.

ffx0yandex-ru avatar Apr 07 '21 08:04 ffx0yandex-ru

Hey @alexeyr 👋🏻, thanks so much for bringing this C++ wrapper.

I'm especially interested in this for some projects we have at huggingface and wanted to leverage what you id here 💪🏻.

Unfortunately, I'm not sure to understand the output structure generated by cargo build, can you suggest the correct way to setup a C++ projects (CMake based, but I'll adapt to any format) in order to get all symbols resolved?

Currently the compiler complains about rust:: references not being found on my side. Many thanks for your help & amazing work here.

Morgan

mfuntowicz avatar Apr 25 '21 11:04 mfuntowicz