tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Support for Golang?

Open atahmasb opened this issue 5 years ago • 8 comments

Hey team, Thanks for this great library, this helped us to avoid installing the whole transformers library to be able to use the tokenizer! Any plan for Golang binding over the Rust implementation or from scratch?

Also, where one would start to write it from scratch in Golang or any other languages?

atahmasb avatar Sep 14 '20 07:09 atahmasb

Hi !

No plans for now to support golang, but we might add support for a cliwhich would make it usable from Golang I guess.

If you want to write one from scratch you can look at the source code, however you might want to double check what you actually need. The GC in Golang might make it not very suitable for your project if you need latency stability.

Narsil avatar Sep 14 '20 08:09 Narsil

@Narsil Is there any consideration for a C API? It would enable the community to build binding for other languages a lot easier.

IIRC compiling Rust to a C library is pretty straightforward. With that said, I'm relatively new to Rust so I could definitely be wrong

pbatk avatar Sep 14 '20 15:09 pbatk

We are not focusing on adding new language bindings at the moment, but stabilizing the current API.

Once that's done adding new languages would be definitely be appreciated. If you're willing to start a C binding, it would be very welcome, but keep in mind we're still doing quite big API changes as we're adding more functionnality into the lib right now.

Narsil avatar Sep 14 '20 16:09 Narsil

@Narsil I'm not at the point where I am in immediate need of the C bindings (I really need Go but I think C as the intermediary would be the best for the community). I will be in need of them in 4-8 months. If the C bindings are still needed at that time I'll open a separate issue and PR.

Thanks!

pbatk avatar Sep 14 '20 16:09 pbatk

@Narsil I don't see a major issue with GC in terms of latency stability. I have used the cli approach but it's not going to be much useful for a production environment like a Go pkg is. I was wondering if the team was open to contribution for a native Go pkg or it should better wait until the main library is more stable!

atahmasb avatar Sep 14 '20 16:09 atahmasb

@atahmasb We're open to all contributions, but yes, the API is not yet stable, we're aiming for it. We'll try to open PR here so the design decisions are at least very clear and public. See #409 for instance. Don't hesitate to comment if you think our direction would not suit your use case.

Narsil avatar Sep 14 '20 16:09 Narsil

omg, why Go are so underrated in machine learning area

batara666 avatar Jun 22 '21 16:06 batara666

@pbatk @atahmasb @batara666 https://github.com/sugarme/tokenizer is a go package with starting implementation of HF tokenizers. It includes Bert, GPT2, and Roberta.

JoeREISys avatar May 27 '22 04:05 JoeREISys

I've created Go bindings for the library, currently surfacing very small API surface, mainly to satisfy my needs, but obviously additional contributions are welcome.

daulet avatar Apr 19 '23 18:04 daulet

I have tried https://github.com/daulet/tokenizers and it works great! Thanks @daulet! I prefer it over https://github.com/sugarme/tokenizer because it uses bindings to the Rust implemention that guarantees compliance with HF tokenization vs implementing tokenization from scratch. It also ensures that the performance is great by leveraging Rust efficiency.

I've made some contributions to it that are waiting review. With those two options, I think we can consider this thread as solved.

clems4ever avatar Jul 24 '23 09:07 clems4ever