Chinese-LLaMA-Alpaca icon indicating copy to clipboard operation
Chinese-LLaMA-Alpaca copied to clipboard

merge_tokenizers合并词表后,fast tokenizers与tokenizers加载的字典长度不同。

Open enze5088 opened this issue 3 years ago • 1 comments

使用merge_tokenizers.py合并中文词典后,非fast 字典长度约为6W token,fast则为4W6左右。是bug吗?

enze5088 avatar Apr 27 '23 09:04 enze5088

merge_tokenizers的代码没有在fast tokenizers上调试过。 建议使用普通的tokenizer

airaria avatar Apr 27 '23 09:04 airaria

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] avatar May 05 '23 00:05 github-actions[bot]