Bert-TextClassification 中英文数据集处理问题

您好，我看您数据集有中文的也有英文的。但中英文取token的方式不是不一样吗？英文是wordpiece，中文是直接切分，我没看到您的代码中有做相关的处理。或是我对您的代码理解有误？

Jul 18 '20 15:07 wangguanhua

We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages. Both models should work out-of-the-box without any code changes. We did update the implementation of BasicTokenizer in tokenization.py to support Chinese character tokenization, so please update if you forked it. 上面是谷歌BERT官方对于Tokenizer的解释，具体的你可以自己去看，意思就是对于中英文，BERT通过一个tokenizer就能无差别分词，不需要你根据不同的语言做不同的处理。

Nov 05 '20 03:11 LittleSJL

多谢多谢

------------------ 原始邮件 ------------------ 发件人: "songyingxin/Bert-TextClassification" <[email protected]>; 发送时间: 2020年11月5日(星期四) 中午11:05 收件人: "songyingxin/Bert-TextClassification"<[email protected]>; 抄送: "767477036"<[email protected]>;"Author"<[email protected]>; 主题: Re: [songyingxin/Bert-TextClassification] 中英文数据集处理问题 (#20)

We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages. Both models should work out-of-the-box without any code changes. We did update the implementation of BasicTokenizer in tokenization.py to support Chinese character tokenization, so please update if you forked it. 上面是谷歌BERT官方对于Tokenizer的解释，具体的你可以自己去看，意思就是对于中英文，BERT通过一个tokenizer就能无差别分词，不需要你根据不同的语言做不同的处理。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Nov 05 '20 06:11 wangguanhua

Bert-TextClassification Bert-TextClassification copied to clipboard

中英文数据集处理问题

Bert-TextClassification
Bert-TextClassification copied to clipboard