[Bug]: 使用data-juicer中language_id_score_filter算子处理中文，会占满主机内存

Open hxdsdu opened this issue 7 months ago • 0 comments

Before Reporting 报告之前

[x] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
[x] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

[x] I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Installation Method 安装方式

source

Data-Juicer Version Data-Juicer版本

1.3.1

Python Version Python版本

3.10.17

Describe the bug 描述这个bug

使用data-juicer中language_id_score_filter算子处理中文，处理完数据之后，会卡主，内存会一直上升

To Reproduce 如何复现

1、配置yaml: project_name: 'chinese-cleaning'

work_dir: './outputs/chinese_clean'

dataset_path: '/mnt/zzb/peixunban/zzb6/data/hxd/data/processed_data/c4_zh' export_path: '/mnt/zzb/peixunban/zzb6/data/hxd/data/processed_data/c4_zh_final/c4_zh.parquet' export_shard_size: 10737418240
np: 10 text_keys: 'text' ds_cache_dir: /mnt/zzb/hxd/cache/ process:

language_id_score_filter: # 语言识别过滤 lang: 'zh' min_score: 0.8 2、运行：nohup python tools/process_data.py --config configs/custom/chinese_clean.yaml > process.log 2>&1 & 3、处理大约50G中文数据集，处理完成后卡死，内存一直上升，直到占满内存

Configs 配置信息

No response

Logs 报错日志

process_04201121.log

Screenshots 截图

No response

Additional 额外信息

No response

Apr 21 '25 02:04 hxdsdu

data-juicer data-juicer copied to clipboard

[Bug]: 使用data-juicer中language_id_score_filter算子处理中文，会占满主机内存

Before Reporting 报告之前

Search before reporting 先搜索，再报告

OS 系统

Installation Method 安装方式

Data-Juicer Version Data-Juicer版本

Python Version Python版本

Describe the bug 描述这个bug

To Reproduce 如何复现

work_dir: './outputs/chinese_clean'

Configs 配置信息

Logs 报错日志

Screenshots 截图

Additional 额外信息

data-juicer
data-juicer copied to clipboard