how to find the correct (token_id, byte_val) relationship for llama3 tokenizer?

Open bugm opened this issue 1 year ago • 0 comments

Hello, all, as I know llama3 tokenizer is based on byte level BPE, But I can not find the relationship between the token_id and (0-255) byte map. For example, with character "Ä" , the utf-8 encode is b'\xc3\x84' = [195,132] . With llama3 tokenizer, "Ä" is encode as 88075 , by checking the vocab and merges, I found 88075 is "ÃĦ", merge with "Ã"(token index 127) and "Ħ"(token index 226), but this did not match the utf-8 byte value 195,132 . So is there any doc to explain how is 0-255 token id mapping to the byte val. For example, with token id 127,226, how is it converted to byte val 195,132 ( b'\xc3\x84' ) and then decode with utf-8 to get character "Ä"?

Oct 24 '24 10:10 bugm