Chinese-LLaMA-Alpaca 训练数据过大的问题

这是一个来自其他项目的问题，希望在这能找到解决方法，谢谢。当我使用 “load_dataset("json", data_files=DATA_PATH)” 加载数据(25G)时，发生报错： “OverflowError: value too large to convert to int32_t”，这种情况要如何处理？

Apr 28 '23 07:04 yfq512

您可以参考我们的预训练代码加载多个小文件并合并成一个数据集

May 04 '23 04:05 iMountTai

这是一个来自其他项目的问题，希望在这能找到解决方法，谢谢。当我使用 “load_dataset("json", data_files=DATA_PATH)” 加载数据(25G)时，发生报错： “OverflowError: value too large to convert to int32_t”，这种情况要如何处理？

请问您解决了吗？

May 04 '23 07:05 yuanzhiyong1999

还没，没来得及看呢

May 04 '23 08:05 yfq512

谢谢

May 04 '23 08:05 yuanzhiyong1999

感觉参照 https://github.com/ymcui/Chinese-LLaMA-Alpaca/issues/222#issuecomment-1534078562 应该能行，你试试

May 04 '23 08:05 yfq512

还没看明白。。

May 04 '23 08:05 yuanzhiyong1999

还没看明白。。

https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/8a6d8dfa18b22055c857a97ca8f35f6612e3a66a/scripts/run_clm_pt_with_peft.py#LL493C17-L493C94 应该就是从这入手了，大文件分解成小文件，搞个循环把所有小文件拼起来。

May 04 '23 08:05 yfq512

我试试看吧

May 04 '23 08:05 yuanzhiyong1999

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

May 13 '23 00:05 github-actions[bot]

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

May 16 '23 22:05 github-actions[bot]

Chinese-LLaMA-Alpaca Chinese-LLaMA-Alpaca copied to clipboard

训练数据过大的问题

Chinese-LLaMA-Alpaca
Chinese-LLaMA-Alpaca copied to clipboard