sentencepiece icon indicating copy to clipboard operation
sentencepiece copied to clipboard

Would plan to support BBPE

Open MrRace opened this issue 4 years ago • 2 comments

Hi,By now sentencepiece support BPE, unigram, char and word. Would you plan to support Byte-Level BPE(BBPE)? Thanks a lot!

MrRace avatar Feb 02 '21 07:02 MrRace

We do not have a plan, but sentencepiece supports byte-fallback feature (--byte_fallback=true in training phase) where UNK chars are split into utf8 byte. I guess we can almost obtain the same effect.

taku910 avatar Feb 11 '21 08:02 taku910

@MrRace There is a nice BBPE implementation available from fair-seq (Google's competitor Facebook). Here is the link: https://github.com/asigalov61/fairseq/tree/master/examples/byte_level_bpe

Also, check out Texar. AFAIK it has similar functions I think... https://github.com/asyml/texar

asigalov61 avatar Feb 19 '21 07:02 asigalov61