Knover icon indicating copy to clipboard operation
Knover copied to clipboard

该plato代码怎么去训练中文模型呢

Open ShengXiaoXiao opened this issue 4 years ago • 1 comments

ShengXiaoXiao avatar Aug 13 '20 07:08 ShengXiaoXiao

可以根据Knover/README.md( https://github.com/PaddlePaddle/Knover/blob/master/README.md )的提示准备好语料,可以使用sentencepiece工具( https://github.com/google/sentencepiece )处理生成词表,格式可以参照./package/dialog_en/voca.txt./package/dialog_en/spm.model;或者使用已有的中文词表,如果是使用其他的Tokenizer(不是sentencepiece tokenizer),可以通过修改./utils/tokenization.py,参考SentencePiecieTokenizer的实现实现对应的Tokenizer(比如叫BasicTokneizer),在配置中的train_args中指定Tokenizer即可(加一行train_args="--tokenizer BasicTokenizer") https://github.com/PaddlePaddle/Knover/blob/15d5279a4370b225b0c388a129b774c9469fcde4/utils/tokenization.py#L124 训练的具体操作与配置也可以参照Knover/README.md

sserdoubleh avatar Aug 13 '20 08:08 sserdoubleh