tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Access utf-8 byte sequence for each token

Open DanielHesslow opened this issue 1 year ago • 2 comments

Hi,

It would be great if it was possible to get the utf-8 byte sequence corresponding to each token id. Since tokenizers return strings, tokens which are not valid unicode strings by themselves will contain � on decode.

This eg. makes streaming and constrained generation much more difficult and error prone than it needs to be.

Additionally if we can get the uf8 byte sequence, decoding also get's much easier and faster, as it's simply a matter of concatenating the corresponding bytes.

Cheers,

DanielHesslow avatar Sep 09 '24 12:09 DanielHesslow

Hey,

I ran into this issue, and wrote a blog post about it: https://stephantul.github.io/python/tokenizers/2023/03/16/bpe/

You can't directly take the byte representation of a token from the vocabulary. Basically, you have to use a specific char map to remap the bytes, and then decode those bytes. If you do this, you can just keep on concatenating them and decoding them.

I hope this helps!

stephantul avatar Sep 24 '24 11:09 stephantul

This remapping is unfortunately not correct for all tokenizers, and there isn't actually a single mapping. Doing it correctly requires treating each internal decoder separately. It's very possible but it is error prone and subject to breaking on changes of the lib. It really needs to be part of the library.

DanielHesslow avatar Sep 29 '24 15:09 DanielHesslow