Qwen
Qwen copied to clipboard
💡 [REQUEST] - <support fast tokenization>
起始日期 | Start Date
20231206
实现PR | Implementation PR
Qwen TokenizerFast
相关Issues | Reference Issues
No response
摘要 | Summary
new feature fast tokenization
基本示例 | Basic Example
Qwen PreTrainedTokenizerFast
缺陷 | Drawbacks
None
未解决问题 | Unresolved questions
return_offsets_mapping =True
QwenTokenizer is based on tiktoken, which appears a slow tokenizer but is faster than fast tokenizer.
If you only need offsets mapping, the problem is since QwenTokenizer applies BPE on UTF-8 encoded byte sequences (not unicode strings), it is not clear how offsets mapping should be interpreted. The documentation at Huggging Face only says:
return_offsets_mapping (bool, optional, defaults to False) — Whether or not to return (char_start, char_end) for each token.
But QwenTokenizer is not based on chars. For example, 司马
( \xe5\x8f\xb8\xe9\xa9\xac
, 司
is \xe5\x8f\xb8
and 马
is \xe9\xa9\xac
if encoded using UTF-8) is tokenized as [b' \xe5\x8f', b'\xb8', b'\xe9\xa9\xac']
.
On the byte level, it is quite clear, the offset mapping is [(0, 3), (3, 4), (4, 7)], but I doubt it is what you want. On the char level, it could be [(0, 2), (1, 2), (2, 3)] if the char including the byte is covered, but it can be confusing as well.
I think it'd be better if you could determine in this kind of scenario which is your best option.
well THX