data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

[Bug]: 使用data-juicer中language_id_score_filter算子处理中文,会占满主机内存

Open hxdsdu opened this issue 7 months ago • 0 comments

Before Reporting 报告之前

  • [x] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

  • [x] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)

Search before reporting 先搜索,再报告

  • [x] I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。

OS 系统

58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Installation Method 安装方式

source

Data-Juicer Version Data-Juicer版本

1.3.1

Python Version Python版本

3.10.17

Describe the bug 描述这个bug

使用data-juicer中language_id_score_filter算子处理中文,处理完数据之后,会卡主,内存会一直上升

To Reproduce 如何复现

1、配置yaml: project_name: 'chinese-cleaning'

work_dir: './outputs/chinese_clean'

dataset_path: '/mnt/zzb/peixunban/zzb6/data/hxd/data/processed_data/c4_zh' export_path: '/mnt/zzb/peixunban/zzb6/data/hxd/data/processed_data/c4_zh_final/c4_zh.parquet' export_shard_size: 10737418240
np: 10 text_keys: 'text' ds_cache_dir: /mnt/zzb/hxd/cache/ process:

  • language_id_score_filter: # 语言识别过滤 lang: 'zh' min_score: 0.8 2、运行:nohup python tools/process_data.py --config configs/custom/chinese_clean.yaml > process.log 2>&1 & 3、处理大约50G中文数据集,处理完成后卡死,内存一直上升,直到占满内存

Image

Image

Configs 配置信息

No response

Logs 报错日志

process_04201121.log

Screenshots 截图

No response

Additional 额外信息

No response

hxdsdu avatar Apr 21 '25 02:04 hxdsdu