Transformers.jl
Transformers.jl copied to clipboard
About choice of Tokenizers
@chengchingwen What do we do about the Huggingface Bert Tokenizer? Is that in the plan?
it's on the roadmap, but won't be covered by the GSoC project. I'm considering wrapping the rust implementation in huggingface/tokenizer with BinaryBuilder.jl.
This project needs more love from the community. Looks like progress is stalled? @chengchingwen
Yes. I'll love to see more people contributing to this project. Currently I'm quite busy (working and studying) and therefore I can't spend too much effort on this project. I do have some unreleased code snippet for the tokenizer but most of them are not well tested. I would try to squeeze out some time to release them (probably 1-2 weeks later).
Some update/thought about tokenizers:
- The rust implementation doesn't have a stable API and even a public C API for making bindings. We would need either come up with a C API or find some way to directly hook rust functions in Julia.
- The easiest way would actually be using PyCall.jl and load huggingface/transformers or huggingface/tokenizer and use those tokenizer in python, but I personally dislike this approach. I avoid python like a plague (that's why there is a Pickle.jl).
- Ideally I would love to see a native julia implementation, but then there would be lots of stuff need to be reimplemented. Currently We only have some basic tokenizer that I modified from the origin python implementation (the
WordPiece
insrc/bert
, theBpe
from BytePairEncoding.jl) and many other stuff from WordTokenizers.jl - Binding is a desired approach. But the problem is that most of the tokenizer are implemented in a binding-unfriendly way, like Cython/Python or C++/rust without C API.
I implemented the word-piece tokenizer using native Julia, named BertWordPieceTokenizer, and I've registered it to the JuliaHub.
Currently, it works to load the word-piece tokenizers, e.g. BERT, RoBERTa but fails for sentencepiece tokenizers such as ALBERT.