data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

使用 Ray 处理数据集时疑似卡住

Open cnlinxi opened this issue 3 weeks ago • 0 comments

Before Asking 在提问之前

  • [x] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • [x] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • [x] I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

使用 Ray 处理数据集时,共创建 11783 个任务,前 1w+ 任务很快处理完成,但后面任务处理特别慢,特别是后面几个 task,处理10小时都尚未处理完成,但 task 确实在缓慢 finish。

配置文件:

project_name: 'ray-demo'
dataset_path: 'my-dataset/'
export_path: 'ray.jsonl'
export_shard_size: 1073741824
temp_dir: '/tmp'
text_keys: 'content'

use_cache: true
cache_compress: 'gzip'

open_tracer: false
trace_num: 0

executor_type: 'ray'
ray_address: 'auto'

op_fusion: true
fusion_strategy: 'probe'

# process schedule
# a list of several process operators with their arguments
process:
  - clean_email_mapper:
  - clean_links_mapper:
  - fix_unicode_mapper:
  - whitespace_normalization_mapper:
  - clean_copyright_mapper:
  - maximum_line_length_filter:
      max_len: 1000
  - average_line_length_filter:
      max_len: 100
  - alphanumeric_filter:
      tokenization: False
      min_ratio: 0.25
  - text_length_filter:
      max_len: 96714
  - words_num_filter:
      min_num: 20
      max_num: 6640
  - word_repetition_filter:
      rep_len: 10
      max_ratio: 0.357

数据集存储占用约 2TB,parquet 格式,512节点。这种情况是否正常,可否定位到卡点位置?

谢谢

Additional 额外信息

No response

cnlinxi avatar Nov 06 '25 13:11 cnlinxi