data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

why often happen: One of the subprocesses has abruptly died during map operation?

Open strongcc opened this issue 1 year ago • 3 comments

Before Asking 在提问之前

  • [X] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • [X] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • [X] I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

您好:

  1. 我在使用dj时遇到报错:RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. 即使我使用单线程也会挂掉且没有告警信息打出来。
  2. 我想使用use_checkpoint,想着如果能保存一个op的结果,我多跑几次能跑完也能凑合用。但是这个功能也不好用,也失败。

我的配置: export_shard_size: 0 export_in_parallel: false np: 10 # number of subprocess to process your dataset open_tracer: true text_keys: 'text'

use_checkpoint: true op_fusion: false cache_compress: 'gzip'

process:

  • language_id_score_filter: lang: [en]
    min_score: 0.8

  • whitespace_normalization_mapper:

我的数据量:800万行数据,5.2G

报错信息: 2024-09-14 06:45:05 | ERROR | data_juicer.core.data:195 - An error occurred during Op [whitespace_normalization_mapper]. Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, **kargs)) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. 2024-09-14 06:45:05 | INFO | data_juicer.core.data:200 - Writing checkpoint of dataset processed by last op...

Saving the dataset (0/40 shards): 0%| | 0/3427644 [00:00<?, ? examples/s] Saving the dataset (0/40 shards): 0%| | 1000/3427644 [36:42<2096:14:49, 2.20s/ examples] Saving the dataset (0/40 shards): 0%| | 1000/3427644 [36:43<2097:19:51, 2.20s/ examples] 2024-09-14 07:23:31 | ERROR | main:33 - An error has been caught in function '', process 'MainProcess' (23318), thread 'MainThread' (140652129257280): Traceback (most recent call last):

File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) │ │ │ │ └ <data_juicer.core.tracer.Tracer object at 0x7fe90ff72110> │ │ │ └ <data_juicer.core.exporter.Exporter object at 0x7fe90ff73220> │ │ └ Dataset({ │ │ features: ['text', 'dj__stats'], │ │ num_rows: 3427644 │ │ }) │ └ <function Mapper.run at 0x7feabc2cbd00> └ <data_juicer.ops.mapper.whitespace_normalization_mapper.WhitespaceNormalizationMapper object at 0x7fe90fb80520>

File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( │ └ <function NestedDataset.map at 0x7fe9102265f0> └ Dataset({ features: ['text', 'dj__stats'], num_rows: 3427644 })

File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, **kargs)) │ │ └ {'num_proc': 40, 'with_rank': False, 'desc': 'whitespace_normalization_mapper_process', 'batched': True, 'batch_size': 1, 'ne... │ └ [<function WhitespaceNormalizationMapper.process at 0x7fe90f9ad900>] └ <class 'data_juicer.core.data.NestedDataset'>

File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) │ │ │ │ └ {'num_proc': 40, 'with_rank': False, 'desc': 'whitespace_normalization_mapper_process', 'batched': True, 'batch_size': 1, 'ne... │ │ │ └ (<function WhitespaceNormalizationMapper.process at 0x7fe90f9ad900>,) │ │ └ Dataset({ │ │ features: ['text', 'dj__stats'], │ │ num_rows: 3427644 │ │ }) │ └ <function Dataset.map at 0x7feabc54d1b0> └ typing.Union File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) │ │ │ │ └ {'num_proc': 40, 'with_rank': False, 'desc': 'whitespace_normalization_mapper_process', 'batched': True, 'batch_size': 1, 'ne... │ │ │ └ (<function WhitespaceNormalizationMapper.process at 0x7fe90f9ad900>,) │ │ └ Dataset({ │ │ features: ['text', 'dj__stats'], │ │ num_rows: 3427644 │ │ }) │ └ <function Dataset.map at 0x7feabc54d120> └ typing.Union File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( │ └ <function iflatmap_unordered at 0x7feac62ba8c0> └ 39 File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError(

RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 197, in process exit(1)

File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/_sitebuiltins.py", line 26, in call raise SystemExit(code) └ 1

SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/home/work/wangsicong/miniconda3/envs/data_juicer/bin/dj-process", line 33, in sys.exit(load_entry_point('py-data-juicer', 'console_scripts', 'dj-process')()) │ │ └ <function importlib_load_entry_point at 0x7fec1ff5bd90> │ └ └ <module 'sys' (built-in)>

File "/home/work/wangsicong/code/data-juicer/tools/process_data.py", line 15, in main executor.run() │ └ <function Executor.run at 0x7fe910227ac0> └ <data_juicer.core.executor.Executor object at 0x7fe90fb80670>

File "/home/work/wangsicong/code/data-juicer/data_juicer/core/executor.py", line 164, in run dataset = dataset.process(ops, │ │ └ [<data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x7fe910036f20>, <data_juicer.ops.mapper.wh... │ └ <function NestedDataset.process at 0x7fe910226560> └ Dataset({ features: ['text'], num_rows: 8537246 })

File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 203, in process checkpointer.save_ckpt(dataset) │ │ └ Dataset({ │ │ features: ['text', 'dj__stats'], │ │ num_rows: 3427644 │ │ }) │ └ <function CheckpointManager.save_ckpt at 0x7fe9102272e0> └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7fe90fb4cfd0>

File "/home/work/wangsicong/code/data-juicer/data_juicer/utils/ckpt_utils.py", line 124, in save_ckpt ds.save_to_disk(self.ckpt_ds_dir, num_proc=self.num_proc) │ │ │ │ │ └ 40 │ │ │ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7fe90fb4cfd0> │ │ │ └ '/home/work/wangsicong/data/ckpt/latest' │ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7fe90fb4cfd0> │ └ <function Dataset.save_to_disk at 0x7feabc543400> └ Dataset({ features: ['text', 'dj__stats'], num_rows: 3427644 })

File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1523, in save_to_disk for job_id, done, content in iflatmap_unordered( │ │ │ └ <function iflatmap_unordered at 0x7feac62ba8c0> │ │ └ 1000 │ └ False └ 0 File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError(

RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, **kargs)) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 197, in process exit(1) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/_sitebuiltins.py", line 26, in call raise SystemExit(code) SystemExit: 1

Additional 额外信息

No response

strongcc avatar Sep 14 '24 07:09 strongcc

这个问题由 Hugging Face 的 dataset.map 引起,请检查是否存在机器资源不足的情况,并尝试减小 num_proc

此外,注意到您的配置中设置 np=10,而在报错信息中却显示为 40。请确认是否使用了旧版本代码,建议更新到最新版本以解决这个问题。

https://github.com/hiyouga/LLaMA-Factory/issues/662 https://github.com/huggingface/datasets/issues/6787 https://discuss.huggingface.co/t/map-multiprocessing-issue/4085

drcege avatar Sep 14 '24 08:09 drcege

谢谢,我是先使用的np=40 ,后面猜测可能是过大,改成10也不行。

那我再继续缩小试试,谢谢。

strongcc avatar Sep 15 '24 13:09 strongcc

这个问题由 Hugging Face 的 dataset.map 引起,请检查是否存在机器资源不足的情况,并尝试减小 num_proc

此外,注意到您的配置中设置 np=10,而在报错信息中却显示为 40。请确认是否使用了旧版本代码,建议更新到最新版本以解决这个问题。

hiyouga/LLaMA-Factory#662 huggingface/datasets#6787 https://discuss.huggingface.co/t/map-multiprocessing-issue/4085

老师好。问下我换成np=4(很小了)。第一个op成功后,第二个op又报错了。还有什么其他的修改建议不?

language_id_score_filter_process (num_proc=4): 100%|#########9| 75974136/75976181 [11:56<00:00, 40247.35 examples/s] language_id_score_filter_process (num_proc=4): 100%|##########| 75976181/75976181 [11:58<00:00, 105736.02 examples/s] 2024-09-15 17:01:38 | INFO | data_juicer.core.data:192 - OP [language_id_score_filter] Done in 20398.867s. Left 30544553 samples.

whitespace_normalization_mapper_process (num_proc=4): 0%| | 0/30544553 [00:00<?, ? examples/s] whitespace_normalization_mapper_process (num_proc=4): 0%| | 0/30544553 [06:58<?, ? examples/s] 2024-09-15 17:31:26 | ERROR | data_juicer.core.data:195 - An error occurred during Op [whitespace_normalization_mapper]. Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, **kargs)) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

strongcc avatar Sep 16 '24 03:09 strongcc

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions[bot] avatar Oct 07 '24 09:10 github-actions[bot]

Close this stale issue.

github-actions[bot] avatar Oct 11 '24 09:10 github-actions[bot]

这个问题由 Hugging Face 的 引起,请检查是否存在机器资源不足的情况,并尝试减小 。dataset.map``num_proc 此外,注意到您的配置中设置 ,而在报错信息中却显示为 40。请确认是否使用了旧版本代码,建议更新到最新版本以解决这个问题。np=10 hiyouga/LLaMA-Factory#662 huggingface/datasets#6787 https://discuss.huggingface.co/t/map-multiprocessing-issue/4085

老师好。问下我换成np=4(很小了)。第一个op成功后,第二个op又报错了。还有什么其他的修改建议不?

language_id_score_filter_process (num_proc=4): 100%|#########9|75974136/75976181 [11:56<00:00, 40247.35 示例/秒] language_id_score_filter_process (num_proc=4):100%|##########|75976181/75976181 [11:58<00:00, 105736.02 例子/秒] 2024-09-15 17:01:38 |信息 |data_juicer.core.data:192 - OP [language_id_score_filter] 在 20398.867 秒内完成。左 30544553 个样本。

whitespace_normalization_mapper_process (num_proc=4):0%| |0/30544553 [00:00<?, ? 示例/秒] whitespace_normalization_mapper_process (num_proc=4):0%| |0/30544553 [06:58<?, ? 示例/秒] 2024-09-15 17:31:26 |错误 |data_juicer.core.data:195 - Op [whitespace_normalization_mapper] 期间发生错误。回溯(最近调用最后一次): 文件 “/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py”,第 187 行,进程中 dataset = op.run(dataset, exporter=exporter, tracer=tracer)文件 “/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py”,第 240 行,运行 new_dataset = dataset.map( 文件 “/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py”,第 248 行, 在地图new_ds = NestedDataset(super().map(*args, **kargs)) 文件 “/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py”,第 593 行,在包装器中输出: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs) 文件 “/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py”,第 558 行,包装器out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs) 文件 “/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py”,第 3197 行,在地图中排名,完成,iflatmap_unordered中的内容(文件 “/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py”,第 656 行,iflatmap_unordered raise RuntimeError(RuntimeError:其中一个子进程在 map作期间突然死亡。要调试错误,请禁用 multiprocessing。

请问您后续是如何解决这个问题的呢?

myym0 avatar May 26 '25 02:05 myym0