tokenizers Access utf-8 byte sequence for each token

Hi,

It would be great if it was possible to get the utf-8 byte sequence corresponding to each token id. Since tokenizers return strings, tokens which are not valid unicode strings by themselves will contain � on decode.

This eg. makes streaming and constrained generation much more difficult and error prone than it needs to be.

Additionally if we can get the uf8 byte sequence, decoding also get's much easier and faster, as it's simply a matter of concatenating the corresponding bytes.

Cheers,

Sep 09 '24 12:09 DanielHesslow

Hey,

I ran into this issue, and wrote a blog post about it: https://stephantul.github.io/python/tokenizers/2023/03/16/bpe/

You can't directly take the byte representation of a token from the vocabulary. Basically, you have to use a specific char map to remap the bytes, and then decode those bytes. If you do this, you can just keep on concatenating them and decoding them.

I hope this helps!

Sep 24 '24 11:09 stephantul

This remapping is unfortunately not correct for all tokenizers, and there isn't actually a single mapping. Doing it correctly requires treating each internal decoder separately. It's very possible but it is error prone and subject to breaking on changes of the lib. It really needs to be part of the library.

Sep 29 '24 15:09 DanielHesslow