tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

ByteLevelBPETokenizer output seems weird

Open seyyaw opened this issue 5 years ago • 2 comments

I use the ByteLevelBPETokenizer to train a custom tokenizer for Amharic language (less-resource language).

tokenizer = ByteLevelBPETokenizer(lowercase=False)

tokenizer.train(files=paths, vocab_size=32000, min_frequency=3, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

The merge.txt and vocab.json files I obtained are now not human readable.

áĪ ħ
áĪ °
ĠáĬ¥ áĬķ
ĠáĬ ¨
Ġáĭ Ń
áĬ Ń
ĠáĪ Ī
áį į

Also the encoding results in the same unreadable output

output = tokenizer.encode("አበበ በሶ በላ። ጫላ ጩቤ ጨበጠ፡፡")
print(output.ids, output.tokens, output.offsets)
>>>[0, 319, 5739, 2883, 4037, 303, 1631, 299, 5173, 506, 748, 11918, 363, 2] ['<s>', 'áĬł', 'áīłáīł', 'Ġáīłáζ', 'ĠáīłáĪĭ', 'áį¢', 'ĠáĮ«', 'áĪĭ', 'ĠáĮ©', 'áī¤', 'ĠáĮ¨', 'áīłáĮł', 'áį¡áį¡', '</s>'] [(0, 0), (0, 3), (3, 9), (9, 16), (16, 23), (23, 26), (26, 30), (30, 33), (33, 37), (37, 40), (40, 44), (44, 50), (50, 56), (0, 0)]

Is this the expected behavior? I will later use this to train a RoberTa model using the run_language_modeling.py script.

Thanks

seyyaw avatar Mar 24 '20 06:03 seyyaw

I also have a similar issue with Persian texts.

taesiri avatar Mar 27 '20 15:03 taesiri

Hey @seyyaw, @taesiri.

TLDR; This is how the byte-level BPE works. Main advantages are:

  • Smaller vocabularies
  • No unknown token

This is totally expected behavior. The byte-level BPE converts all the Unicode code points into multiple byte-level characters:

  1. Each Unicode code point is decomposed into bytes (1 byte for ASCII characters, and up to 4 bytes for UTF-8 Unicode code points)
  2. Each byte value gets a "visible" character assigned to it from the beginning of the Unicode table. This is especially important because there are a lot of control characters, so we can't just have a simple mapping ASCII Table character <-> byte value. So some characters get other representations, like for example the white space U+0020 becomes Ġ.

The purpose is, by doing so, you end up with an initial alphabet of 256 tokens. These 256 tokens can then be merged together to represent any other token in the vocabulary. This results in smaller vocabularies, that won't ever need an "unknown" token.

n1t0 avatar Mar 27 '20 16:03 n1t0