data-juicer
data-juicer copied to clipboard
使用 Ray 处理数据集时疑似卡住
Before Asking 在提问之前
-
[x] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
-
[x] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking 先搜索,再提问
- [x] I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。
Question
使用 Ray 处理数据集时,共创建 11783 个任务,前 1w+ 任务很快处理完成,但后面任务处理特别慢,特别是后面几个 task,处理10小时都尚未处理完成,但 task 确实在缓慢 finish。
配置文件:
project_name: 'ray-demo'
dataset_path: 'my-dataset/'
export_path: 'ray.jsonl'
export_shard_size: 1073741824
temp_dir: '/tmp'
text_keys: 'content'
use_cache: true
cache_compress: 'gzip'
open_tracer: false
trace_num: 0
executor_type: 'ray'
ray_address: 'auto'
op_fusion: true
fusion_strategy: 'probe'
# process schedule
# a list of several process operators with their arguments
process:
- clean_email_mapper:
- clean_links_mapper:
- fix_unicode_mapper:
- whitespace_normalization_mapper:
- clean_copyright_mapper:
- maximum_line_length_filter:
max_len: 1000
- average_line_length_filter:
max_len: 100
- alphanumeric_filter:
tokenization: False
min_ratio: 0.25
- text_length_filter:
max_len: 96714
- words_num_filter:
min_num: 20
max_num: 6640
- word_repetition_filter:
rep_len: 10
max_ratio: 0.357
数据集存储占用约 2TB,parquet 格式,512节点。这种情况是否正常,可否定位到卡点位置?
谢谢
Additional 额外信息
No response