Chinese-LLaMA-Alpaca icon indicating copy to clipboard operation
Chinese-LLaMA-Alpaca copied to clipboard

请教一下, 预训练节点, eos和bos是怎么添加到训练文本里面的?

Open ruanshudong opened this issue 2 years ago • 2 comments

具体是谁来做的? 是文本预处理的时候就添加好(在每篇文章开头和结尾增加eos/bos), 还是代码里面自动做的?这里一直没太弄明白, 多谢!

ruanshudong avatar May 17 '23 06:05 ruanshudong

我理解是在文章预处理时就在头部和尾部添加好bos/eos, 但是[PAD]是什么用? 似乎用不上?

ruanshudong avatar May 17 '23 08:05 ruanshudong

tokenizer自动处理。不需要对预训练数据特殊处理。

airaria avatar May 17 '23 11:05 airaria

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] avatar May 24 '23 22:05 github-actions[bot]

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

github-actions[bot] avatar May 28 '23 22:05 github-actions[bot]