data-juicer 使用 Ray 处理数据集时疑似卡住

使用 Ray 处理数据集时疑似卡住

Open cnlinxi opened this issue 3 weeks ago • 0 comments

Before Asking 在提问之前

[x] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
[x] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

[x] I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

使用 Ray 处理数据集时，共创建 11783 个任务，前 1w+ 任务很快处理完成，但后面任务处理特别慢，特别是后面几个 task，处理10小时都尚未处理完成，但 task 确实在缓慢 finish。

配置文件：

project_name: 'ray-demo'
dataset_path: 'my-dataset/'
export_path: 'ray.jsonl'
export_shard_size: 1073741824
temp_dir: '/tmp'
text_keys: 'content'

use_cache: true
cache_compress: 'gzip'

open_tracer: false
trace_num: 0

executor_type: 'ray'
ray_address: 'auto'

op_fusion: true
fusion_strategy: 'probe'

# process schedule
# a list of several process operators with their arguments
process:
  - clean_email_mapper:
  - clean_links_mapper:
  - fix_unicode_mapper:
  - whitespace_normalization_mapper:
  - clean_copyright_mapper:
  - maximum_line_length_filter:
      max_len: 1000
  - average_line_length_filter:
      max_len: 100
  - alphanumeric_filter:
      tokenization: False
      min_ratio: 0.25
  - text_length_filter:
      max_len: 96714
  - words_num_filter:
      min_num: 20
      max_num: 6640
  - word_repetition_filter:
      rep_len: 10
      max_ratio: 0.357

数据集存储占用约 2TB，parquet 格式，512节点。这种情况是否正常，可否定位到卡点位置？

谢谢

Additional 额外信息

No response

Nov 06 '25 13:11 cnlinxi

data-juicer data-juicer copied to clipboard

使用 Ray 处理数据集时疑似卡住

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

Additional 额外信息

data-juicer
data-juicer copied to clipboard