Qwen icon indicating copy to clipboard operation
Qwen copied to clipboard

💡 [REQUEST] - <support fast tokenization>

Open windar427 opened this issue 1 year ago • 2 comments

起始日期 | Start Date

20231206

实现PR | Implementation PR

Qwen TokenizerFast

相关Issues | Reference Issues

No response

摘要 | Summary

new feature fast tokenization

基本示例 | Basic Example

Qwen PreTrainedTokenizerFast

缺陷 | Drawbacks

None

未解决问题 | Unresolved questions

return_offsets_mapping =True

windar427 avatar Dec 06 '23 06:12 windar427

QwenTokenizer is based on tiktoken, which appears a slow tokenizer but is faster than fast tokenizer.

If you only need offsets mapping, the problem is since QwenTokenizer applies BPE on UTF-8 encoded byte sequences (not unicode strings), it is not clear how offsets mapping should be interpreted. The documentation at Huggging Face only says:

return_offsets_mapping (bool, optional, defaults to False) — Whether or not to return (char_start, char_end) for each token.

But QwenTokenizer is not based on chars. For example, 司马 ( \xe5\x8f\xb8\xe9\xa9\xac, is \xe5\x8f\xb8 and is \xe9\xa9\xac if encoded using UTF-8) is tokenized as [b' \xe5\x8f', b'\xb8', b'\xe9\xa9\xac'].

On the byte level, it is quite clear, the offset mapping is [(0, 3), (3, 4), (4, 7)], but I doubt it is what you want. On the char level, it could be [(0, 2), (1, 2), (2, 3)] if the char including the byte is covered, but it can be confusing as well.

I think it'd be better if you could determine in this kind of scenario which is your best option.

jklj077 avatar Dec 06 '23 08:12 jklj077

well THX

windar427 avatar Dec 07 '23 03:12 windar427