Chinese-LLaMA-Alpaca icon indicating copy to clipboard operation
Chinese-LLaMA-Alpaca copied to clipboard

训练数据过大的问题

Open yfq512 opened this issue 2 years ago • 9 comments

这是一个来自其他项目的问题,希望在这能找到解决方法,谢谢。 当我使用 “load_dataset("json", data_files=DATA_PATH)” 加载数据(25G)时,发生报错: “OverflowError: value too large to convert to int32_t”,这种情况要如何处理?

yfq512 avatar Apr 28 '23 07:04 yfq512

您可以参考我们的预训练代码加载多个小文件并合并成一个数据集

iMountTai avatar May 04 '23 04:05 iMountTai

这是一个来自其他项目的问题,希望在这能找到解决方法,谢谢。 当我使用 “load_dataset("json", data_files=DATA_PATH)” 加载数据(25G)时,发生报错: “OverflowError: value too large to convert to int32_t”,这种情况要如何处理?

请问您解决了吗?

yuanzhiyong1999 avatar May 04 '23 07:05 yuanzhiyong1999

还没,没来得及看呢

yfq512 avatar May 04 '23 08:05 yfq512

谢谢

yuanzhiyong1999 avatar May 04 '23 08:05 yuanzhiyong1999

感觉参照 https://github.com/ymcui/Chinese-LLaMA-Alpaca/issues/222#issuecomment-1534078562 应该能行,你试试

yfq512 avatar May 04 '23 08:05 yfq512

还没看明白。。

yuanzhiyong1999 avatar May 04 '23 08:05 yuanzhiyong1999

还没看明白。。

https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/8a6d8dfa18b22055c857a97ca8f35f6612e3a66a/scripts/run_clm_pt_with_peft.py#LL493C17-L493C94 应该就是从这入手了,大文件分解成小文件,搞个循环把所有小文件拼起来。

yfq512 avatar May 04 '23 08:05 yfq512

我试试看吧

yuanzhiyong1999 avatar May 04 '23 08:05 yuanzhiyong1999

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] avatar May 13 '23 00:05 github-actions[bot]

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

github-actions[bot] avatar May 16 '23 22:05 github-actions[bot]