data-juicer
data-juicer copied to clipboard
[Bug]: 使用data-juicer中language_id_score_filter算子处理中文,会占满主机内存
Before Reporting 报告之前
-
[x] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
-
[x] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
- [x] I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。
OS 系统
58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Installation Method 安装方式
source
Data-Juicer Version Data-Juicer版本
1.3.1
Python Version Python版本
3.10.17
Describe the bug 描述这个bug
使用data-juicer中language_id_score_filter算子处理中文,处理完数据之后,会卡主,内存会一直上升
To Reproduce 如何复现
1、配置yaml: project_name: 'chinese-cleaning'
work_dir: './outputs/chinese_clean'
dataset_path: '/mnt/zzb/peixunban/zzb6/data/hxd/data/processed_data/c4_zh'
export_path: '/mnt/zzb/peixunban/zzb6/data/hxd/data/processed_data/c4_zh_final/c4_zh.parquet'
export_shard_size: 10737418240
np: 10
text_keys: 'text'
ds_cache_dir: /mnt/zzb/hxd/cache/
process:
- language_id_score_filter: # 语言识别过滤 lang: 'zh' min_score: 0.8 2、运行:nohup python tools/process_data.py --config configs/custom/chinese_clean.yaml > process.log 2>&1 & 3、处理大约50G中文数据集,处理完成后卡死,内存一直上升,直到占满内存
Configs 配置信息
No response
Logs 报错日志
Screenshots 截图
No response
Additional 额外信息
No response