data-juicer
data-juicer copied to clipboard
why often happen: One of the subprocesses has abruptly died during map operation?
Before Asking 在提问之前
-
[X] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
-
[X] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking 先搜索,再提问
- [X] I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。
Question
您好:
- 我在使用dj时遇到报错:RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. 即使我使用单线程也会挂掉且没有告警信息打出来。
- 我想使用use_checkpoint,想着如果能保存一个op的结果,我多跑几次能跑完也能凑合用。但是这个功能也不好用,也失败。
我的配置: export_shard_size: 0 export_in_parallel: false np: 10 # number of subprocess to process your dataset open_tracer: true text_keys: 'text'
use_checkpoint: true op_fusion: false cache_compress: 'gzip'
process:
-
language_id_score_filter: lang: [en]
min_score: 0.8 -
whitespace_normalization_mapper:
我的数据量:800万行数据,5.2G
报错信息: 2024-09-14 06:45:05 | ERROR | data_juicer.core.data:195 - An error occurred during Op [whitespace_normalization_mapper]. Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, **kargs)) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. 2024-09-14 06:45:05 | INFO | data_juicer.core.data:200 - Writing checkpoint of dataset processed by last op...
Saving the dataset (0/40 shards): 0%| | 0/3427644 [00:00<?, ? examples/s]
Saving the dataset (0/40 shards): 0%| | 1000/3427644 [36:42<2096:14:49, 2.20s/ examples]
Saving the dataset (0/40 shards): 0%| | 1000/3427644 [36:43<2097:19:51, 2.20s/ examples]
2024-09-14 07:23:31 | ERROR | main:33 - An error has been caught in function '
File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) │ │ │ │ └ <data_juicer.core.tracer.Tracer object at 0x7fe90ff72110> │ │ │ └ <data_juicer.core.exporter.Exporter object at 0x7fe90ff73220> │ │ └ Dataset({ │ │ features: ['text', 'dj__stats'], │ │ num_rows: 3427644 │ │ }) │ └ <function Mapper.run at 0x7feabc2cbd00> └ <data_juicer.ops.mapper.whitespace_normalization_mapper.WhitespaceNormalizationMapper object at 0x7fe90fb80520>
File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( │ └ <function NestedDataset.map at 0x7fe9102265f0> └ Dataset({ features: ['text', 'dj__stats'], num_rows: 3427644 })
File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, **kargs)) │ │ └ {'num_proc': 40, 'with_rank': False, 'desc': 'whitespace_normalization_mapper_process', 'batched': True, 'batch_size': 1, 'ne... │ └ [<function WhitespaceNormalizationMapper.process at 0x7fe90f9ad900>] └ <class 'data_juicer.core.data.NestedDataset'>
File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) │ │ │ │ └ {'num_proc': 40, 'with_rank': False, 'desc': 'whitespace_normalization_mapper_process', 'batched': True, 'batch_size': 1, 'ne... │ │ │ └ (<function WhitespaceNormalizationMapper.process at 0x7fe90f9ad900>,) │ │ └ Dataset({ │ │ features: ['text', 'dj__stats'], │ │ num_rows: 3427644 │ │ }) │ └ <function Dataset.map at 0x7feabc54d1b0> └ typing.Union File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) │ │ │ │ └ {'num_proc': 40, 'with_rank': False, 'desc': 'whitespace_normalization_mapper_process', 'batched': True, 'batch_size': 1, 'ne... │ │ │ └ (<function WhitespaceNormalizationMapper.process at 0x7fe90f9ad900>,) │ │ └ Dataset({ │ │ features: ['text', 'dj__stats'], │ │ num_rows: 3427644 │ │ }) │ └ <function Dataset.map at 0x7feabc54d120> └ typing.Union File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( │ └ <function iflatmap_unordered at 0x7feac62ba8c0> └ 39 File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError(
RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 197, in process exit(1)
File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/_sitebuiltins.py", line 26, in call raise SystemExit(code) └ 1
SystemExit: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/work/wangsicong/miniconda3/envs/data_juicer/bin/dj-process", line 33, in
sys.exit(load_entry_point('py-data-juicer', 'console_scripts', 'dj-process')()) │ │ └ <function importlib_load_entry_point at 0x7fec1ff5bd90> │ └ └ <module 'sys' (built-in)>
File "/home/work/wangsicong/code/data-juicer/tools/process_data.py", line 15, in main executor.run() │ └ <function Executor.run at 0x7fe910227ac0> └ <data_juicer.core.executor.Executor object at 0x7fe90fb80670>
File "/home/work/wangsicong/code/data-juicer/data_juicer/core/executor.py", line 164, in run dataset = dataset.process(ops, │ │ └ [<data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x7fe910036f20>, <data_juicer.ops.mapper.wh... │ └ <function NestedDataset.process at 0x7fe910226560> └ Dataset({ features: ['text'], num_rows: 8537246 })
File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 203, in process checkpointer.save_ckpt(dataset) │ │ └ Dataset({ │ │ features: ['text', 'dj__stats'], │ │ num_rows: 3427644 │ │ }) │ └ <function CheckpointManager.save_ckpt at 0x7fe9102272e0> └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7fe90fb4cfd0>
File "/home/work/wangsicong/code/data-juicer/data_juicer/utils/ckpt_utils.py", line 124, in save_ckpt ds.save_to_disk(self.ckpt_ds_dir, num_proc=self.num_proc) │ │ │ │ │ └ 40 │ │ │ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7fe90fb4cfd0> │ │ │ └ '/home/work/wangsicong/data/ckpt/latest' │ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7fe90fb4cfd0> │ └ <function Dataset.save_to_disk at 0x7feabc543400> └ Dataset({ features: ['text', 'dj__stats'], num_rows: 3427644 })
File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1523, in save_to_disk for job_id, done, content in iflatmap_unordered( │ │ │ └ <function iflatmap_unordered at 0x7feac62ba8c0> │ │ └ 1000 │ └ False └ 0 File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError(
RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, **kargs)) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 197, in process exit(1) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/_sitebuiltins.py", line 26, in call raise SystemExit(code) SystemExit: 1
Additional 额外信息
No response
这个问题由 Hugging Face 的 dataset.map 引起,请检查是否存在机器资源不足的情况,并尝试减小 num_proc。
此外,注意到您的配置中设置 np=10,而在报错信息中却显示为 40。请确认是否使用了旧版本代码,建议更新到最新版本以解决这个问题。
https://github.com/hiyouga/LLaMA-Factory/issues/662 https://github.com/huggingface/datasets/issues/6787 https://discuss.huggingface.co/t/map-multiprocessing-issue/4085
谢谢,我是先使用的np=40 ,后面猜测可能是过大,改成10也不行。
那我再继续缩小试试,谢谢。
这个问题由 Hugging Face 的
dataset.map引起,请检查是否存在机器资源不足的情况,并尝试减小num_proc。此外,注意到您的配置中设置
np=10,而在报错信息中却显示为 40。请确认是否使用了旧版本代码,建议更新到最新版本以解决这个问题。hiyouga/LLaMA-Factory#662 huggingface/datasets#6787 https://discuss.huggingface.co/t/map-multiprocessing-issue/4085
老师好。问下我换成np=4(很小了)。第一个op成功后,第二个op又报错了。还有什么其他的修改建议不?
language_id_score_filter_process (num_proc=4): 100%|#########9| 75974136/75976181 [11:56<00:00, 40247.35 examples/s] language_id_score_filter_process (num_proc=4): 100%|##########| 75976181/75976181 [11:58<00:00, 105736.02 examples/s] 2024-09-15 17:01:38 | INFO | data_juicer.core.data:192 - OP [language_id_score_filter] Done in 20398.867s. Left 30544553 samples.
whitespace_normalization_mapper_process (num_proc=4): 0%| | 0/30544553 [00:00<?, ? examples/s] whitespace_normalization_mapper_process (num_proc=4): 0%| | 0/30544553 [06:58<?, ? examples/s] 2024-09-15 17:31:26 | ERROR | data_juicer.core.data:195 - An error occurred during Op [whitespace_normalization_mapper]. Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, **kargs)) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.
Close this stale issue.
这个问题由 Hugging Face 的 引起,请检查是否存在机器资源不足的情况,并尝试减小 。
dataset.map``num_proc此外,注意到您的配置中设置 ,而在报错信息中却显示为 40。请确认是否使用了旧版本代码,建议更新到最新版本以解决这个问题。np=10hiyouga/LLaMA-Factory#662 huggingface/datasets#6787 https://discuss.huggingface.co/t/map-multiprocessing-issue/4085老师好。问下我换成np=4(很小了)。第一个op成功后,第二个op又报错了。还有什么其他的修改建议不?
language_id_score_filter_process (num_proc=4): 100%|#########9|75974136/75976181 [11:56<00:00, 40247.35 示例/秒] language_id_score_filter_process (num_proc=4):100%|##########|75976181/75976181 [11:58<00:00, 105736.02 例子/秒] 2024-09-15 17:01:38 |信息 |data_juicer.core.data:192 - OP [language_id_score_filter] 在 20398.867 秒内完成。左 30544553 个样本。
whitespace_normalization_mapper_process (num_proc=4):0%| |0/30544553 [00:00<?, ? 示例/秒] whitespace_normalization_mapper_process (num_proc=4):0%| |0/30544553 [06:58<?, ? 示例/秒] 2024-09-15 17:31:26 |错误 |data_juicer.core.data:195 - Op [whitespace_normalization_mapper] 期间发生错误。回溯(最近调用最后一次): 文件 “/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py”,第 187 行,进程中 dataset = op.run(dataset, exporter=exporter, tracer=tracer)文件 “/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py”,第 240 行,运行 new_dataset = dataset.map( 文件 “/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py”,第 248 行, 在地图new_ds = NestedDataset(super().map(*args, **kargs)) 文件 “/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py”,第 593 行,在包装器中输出: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs) 文件 “/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py”,第 558 行,包装器out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs) 文件 “/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py”,第 3197 行,在地图中排名,完成,iflatmap_unordered中的内容(文件 “/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py”,第 656 行,iflatmap_unordered raise RuntimeError(RuntimeError:其中一个子进程在 map作期间突然死亡。要调试错误,请禁用 multiprocessing。
请问您后续是如何解决这个问题的呢?