Bert-TextClassification icon indicating copy to clipboard operation
Bert-TextClassification copied to clipboard

中英文数据集处理问题

Open wangguanhua opened this issue 3 years ago • 2 comments

您好,我看您数据集有中文的也有英文的。但中英文取token的方式不是不一样吗?英文是wordpiece,中文是直接切分,我没看到您的代码中有做相关的处理。或是我对您的代码理解有误?

wangguanhua avatar Jul 18 '20 15:07 wangguanhua

We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages. Both models should work out-of-the-box without any code changes. We did update the implementation of BasicTokenizer in tokenization.py to support Chinese character tokenization, so please update if you forked it. 上面是谷歌BERT官方对于Tokenizer的解释,具体的你可以自己去看,意思就是对于中英文,BERT通过一个tokenizer就能无差别分词,不需要你根据不同的语言做不同的处理。

LittleSJL avatar Nov 05 '20 03:11 LittleSJL

多谢多谢

------------------ 原始邮件 ------------------ 发件人: "songyingxin/Bert-TextClassification" <[email protected]>; 发送时间: 2020年11月5日(星期四) 中午11:05 收件人: "songyingxin/Bert-TextClassification"<[email protected]>; 抄送: "767477036"<[email protected]>;"Author"<[email protected]>; 主题: Re: [songyingxin/Bert-TextClassification] 中英文数据集处理问题 (#20)

We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages. Both models should work out-of-the-box without any code changes. We did update the implementation of BasicTokenizer in tokenization.py to support Chinese character tokenization, so please update if you forked it. 上面是谷歌BERT官方对于Tokenizer的解释,具体的你可以自己去看,意思就是对于中英文,BERT通过一个tokenizer就能无差别分词,不需要你根据不同的语言做不同的处理。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

wangguanhua avatar Nov 05 '20 06:11 wangguanhua