tokenizers Cannot tokenize byte sequences that are not valid UTF-8 due to design flaw

Hello,

The BPE algorithm is capable of tokenizing any byte sequence, and LLMs generally accept any sequence of tokens and use token dictionaries that can successfully represent any byte sequence, but the encode method in tokenizers accepts a type that has to be valid UTF-8. So there are lots of byte sequences, many of which are only 1 byte long, which you cannot tokenize using this library.

Mar 08 '25 21:03 sharpobject

Does ByteLevel not work for your use-case?

Mar 19 '25 05:03 MeetThePatel

Yeah I am not sure I understand, this library can take byte sequences, we usually convert bytes to a corresponding utf8 string representation, as mentioned by @MeetThePatel , https://github.com/huggingface/tokenizers/blob/backtrack/tokenizers/src/pre_tokenizers/byte_level.rs#L116-L116 this explains why the vocab is in utf-8. Closing as I it is not an issue

Jun 19 '25 02:06 ArthurZucker

Sorry, what's the correct way to use the python bindings to use an existing vocab to encode byte-sequences?

For example, the below does not work:

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = b"\x80"
tokens = tokenizer.encode(text)

Jun 21 '25 06:06 sharpobject

You have different ways of doing this, but:

Do you have raw bytes as inputs? OR do you have some bytes at some point in your inputs that is a string. For this you need byte_fallback, which is not supported by gpt2
If you want to kinda get an encoding for bytes, you need to convert them to the string representation. The bytelevel pre-tokenizer does that for string, here you need to convert them:

value = int.from_bytes(b, byteorder='big', signed=False)
128
```
then you take: 
https://github.com/openai/gpt-2/blob/master/src/encoder.py#L9
and you know this is `  'Ģ'`.
```
In [37]: tokenizer.encode('Ģ')
Out[37]: [222]

In [38]: tokenizer.decode(222)
Out[38]: '�'
```

Jun 24 '25 12:06 ArthurZucker

I don't think so

>>> from transformers import GPT2Tokenizer
>>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
>>> tokenizer.encode('Ģ')
...
[128, 95]

Jun 24 '25 16:06 sharpobject

I mean, I understand that I can use a separate lookup table of individual byte-values to tokens representing single bytes. But BPE tokenizers generally work by applying pre-tokenization to divide a byte sequence into pre-tokens, and then applying merges within these pre-tokens to build up tokens, in the order specified in the vocab. That is the procedure that I would like to run, on some byte sequences which are sometimes not valid UTF-8. I'm not aware of any results in automata theory or whatever that have revealed that it is not possible to use regular expressions on sequences of bytes.

Jun 24 '25 17:06 sharpobject

It's not that it's not possible, it's just that the entire library revolves around string inputs. Now https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/decoders/byte_fallback.rs#L70 might related as I mentioned.

What you are asking for is for a function / way to just pass raw bytes instead of strings. But string are just matched to a corresponding ID in the vocab. The GPT2 vocab is not bytes in tokenizers, it is a string representation of the byte. This allows much simpler visualization. This is why you have to convert your bytes first.

Now, you can remove the pretokenizer:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
tok._tokenizer
tok._tokenizer.pre_tokenizer = None

Now, you are using GPT2Tokenizer from transformers. Are you aware that it is completely unrelated to tokenizers. GPT2Tokenizer is pure python so you can change it however you want.

GPT2TokenizerFast is the one that I used and that relies on the tokenizers library.

Jun 25 '25 11:06 ArthurZucker