tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

encode bytes directly

Open tsengalb99 opened this issue 2 months ago • 2 comments

Is there a way to directly encode bytes with a bpe based HF tokenizer without having to decode the string first?

tsengalb99 avatar Oct 19 '25 03:10 tsengalb99

Hello,

yes there is! You need a mapping from characters to bytes, and skip any pre-tokenization and normalization steps.

from tokenizers import Tokenizer

def bytes_to_unicode() -> dict[int, str]:
    """Converts byte values to Unicode characters for byte-level tokenization."""
    input_ids = list(range(33, 127)) + list(range(161, 173)) + list(range(174, 256))
    output_ids = input_ids[:]
    n = 0
    for char_ord in range(256):
        if char_ord not in input_ids:
            input_ids.append(char_ord)
            output_ids.append(256 + n)
            n += 1
    output_chars = [chr(c) for c in output_ids]
    return dict(zip(input_ids, output_chars, strict=True))

mapping = bytes_to_unicode()
tokenizer = Tokenizer.from_pretrained("gpt2")
tokenizer.normalizer = None
tokenizer.pre_tokenizer = None

byte_string = [65, 66, 233]
pseudo_string = "".join(mapping[b] for b in byte_string)

encoded = tokenizer.encode(pseudo_string, add_special_tokens=False)

Hope this helps!

stephantul avatar Oct 29 '25 09:10 stephantul

In transformers we use this:

def bytes_to_unicode():
    """
    Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control
    characters the bpe code barfs on.

    The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab
    if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for
    decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup
    tables between utf-8 bytes and unicode strings.
    """
    bs = (
        list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
    )
    cs = bs[:]
    n = 0
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8 + n)
            n += 1
    cs = [chr(n) for n in cs]
    return dict(zip(bs, cs))

then

       def token_bytes_to_string(b):
            return "".join([byte_encoder[ord(char)] for char in b.decode("latin-1")])

ArthurZucker avatar Nov 28 '25 07:11 ArthurZucker