Chinese-LLaMA-Alpaca icon indicating copy to clipboard operation
Chinese-LLaMA-Alpaca copied to clipboard

关于二次预训练使用的通用中文数据的提问

Open kleinchueng opened this issue 2 years ago • 1 comments

  • [ ] 基础模型:LLaMA / Alpaca / LLaMA-Plus / Alpaca-Plus
  • [ ] 运行系统: Linux
  • [ ] 问题分类:模型训练

请问通用中文数据的来源有哪些,是否有样例呢,是否也可以按照data/pt_data.txt一样只需要将中文指令数据从里面提取出来呢?

kleinchueng avatar May 20 '23 13:05 kleinchueng

data/pt_data.txt只是示例,预训练数据与指令数据无关, 比如可以采用中文wiki语料,悟道语料(WuDaoCorpus)、ROOTS数据集中的中文部分等语料

airaria avatar May 21 '23 12:05 airaria

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] avatar May 28 '23 22:05 github-actions[bot]