sentencepiece
sentencepiece copied to clipboard
Would plan to support BBPE
Hi,By now sentencepiece support BPE, unigram, char and word. Would you plan to support Byte-Level BPE(BBPE)? Thanks a lot!
We do not have a plan, but sentencepiece supports byte-fallback feature (--byte_fallback=true in training phase) where UNK chars are split into utf8 byte. I guess we can almost obtain the same effect.
@MrRace There is a nice BBPE implementation available from fair-seq (Google's competitor Facebook). Here is the link: https://github.com/asigalov61/fairseq/tree/master/examples/byte_level_bpe
Also, check out Texar. AFAIK it has similar functions I think... https://github.com/asyml/texar