tokenizers
tokenizers copied to clipboard
encode bytes directly
Is there a way to directly encode bytes with a bpe based HF tokenizer without having to decode the string first?
Hello,
yes there is! You need a mapping from characters to bytes, and skip any pre-tokenization and normalization steps.
from tokenizers import Tokenizer
def bytes_to_unicode() -> dict[int, str]:
"""Converts byte values to Unicode characters for byte-level tokenization."""
input_ids = list(range(33, 127)) + list(range(161, 173)) + list(range(174, 256))
output_ids = input_ids[:]
n = 0
for char_ord in range(256):
if char_ord not in input_ids:
input_ids.append(char_ord)
output_ids.append(256 + n)
n += 1
output_chars = [chr(c) for c in output_ids]
return dict(zip(input_ids, output_chars, strict=True))
mapping = bytes_to_unicode()
tokenizer = Tokenizer.from_pretrained("gpt2")
tokenizer.normalizer = None
tokenizer.pre_tokenizer = None
byte_string = [65, 66, 233]
pseudo_string = "".join(mapping[b] for b in byte_string)
encoded = tokenizer.encode(pseudo_string, add_special_tokens=False)
Hope this helps!
In transformers we use this:
def bytes_to_unicode():
"""
Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control
characters the bpe code barfs on.
The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab
if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for
decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup
tables between utf-8 bytes and unicode strings.
"""
bs = (
list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
)
cs = bs[:]
n = 0
for b in range(2**8):
if b not in bs:
bs.append(b)
cs.append(2**8 + n)
n += 1
cs = [chr(n) for n in cs]
return dict(zip(bs, cs))
then
def token_bytes_to_string(b):
return "".join([byte_encoder[ord(char)] for char in b.decode("latin-1")])