Transformers.jl icon indicating copy to clipboard operation
Transformers.jl copied to clipboard

About choice of Tokenizers

Open freddycct opened this issue 4 years ago • 5 comments

@chengchingwen What do we do about the Huggingface Bert Tokenizer? Is that in the plan?

freddycct avatar Aug 01 '20 08:08 freddycct

it's on the roadmap, but won't be covered by the GSoC project. I'm considering wrapping the rust implementation in huggingface/tokenizer with BinaryBuilder.jl.

chengchingwen avatar Aug 14 '20 14:08 chengchingwen

This project needs more love from the community. Looks like progress is stalled? @chengchingwen

freddycct avatar Feb 16 '21 20:02 freddycct

Yes. I'll love to see more people contributing to this project. Currently I'm quite busy (working and studying) and therefore I can't spend too much effort on this project. I do have some unreleased code snippet for the tokenizer but most of them are not well tested. I would try to squeeze out some time to release them (probably 1-2 weeks later).

Some update/thought about tokenizers:

  1. The rust implementation doesn't have a stable API and even a public C API for making bindings. We would need either come up with a C API or find some way to directly hook rust functions in Julia.
  2. The easiest way would actually be using PyCall.jl and load huggingface/transformers or huggingface/tokenizer and use those tokenizer in python, but I personally dislike this approach. I avoid python like a plague (that's why there is a Pickle.jl).
  3. Ideally I would love to see a native julia implementation, but then there would be lots of stuff need to be reimplemented. Currently We only have some basic tokenizer that I modified from the origin python implementation (the WordPiece in src/bert, the Bpe from BytePairEncoding.jl) and many other stuff from WordTokenizers.jl
  4. Binding is a desired approach. But the problem is that most of the tokenizer are implemented in a binding-unfriendly way, like Cython/Python or C++/rust without C API.

chengchingwen avatar Feb 22 '21 15:02 chengchingwen

I implemented the word-piece tokenizer using native Julia, named BertWordPieceTokenizer, and I've registered it to the JuliaHub.

Currently, it works to load the word-piece tokenizers, e.g. BERT, RoBERTa but fails for sentencepiece tokenizers such as ALBERT.

SeanLee97 avatar Mar 09 '22 07:03 SeanLee97

@SeanLee97 Actually, we already have word piece tokenizer in Transformers.jl. See here and here

chengchingwen avatar Mar 09 '22 08:03 chengchingwen