data-juicer 如何配置s3类型的数据

如何配置s3类型的数据

Open Young-zj opened this issue 1 month ago • 3 comments

Before Asking 在提问之前

[x] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
[x] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

[x] I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

请问如何使用 s3:// 对象存储地址做为输入和输出路径的配置。

Additional 额外信息

No response

Oct 23 '25 13:10 Young-zj

请问你的场景，是将整个数据集放在s3，还是数据里面每一条多媒体文件存放在s3？

Nov 03 '25 20:11 cyruszhang

请问你的场景，是将整个数据集放在s3，还是数据里面每一条多媒体文件存放在s3？将整个数据集放在了s3，算子配置类似于这种

project_name: ray-dj-test
dataset_path: s3://mnt/dst/the-pile-philpaper-refine-result.jsonl
export_path:  s3:/mnt/dst/processed_demo/
np: 2
executor_type: ray
ray_address: auto
process:
  - clean_links_mapper:
```。另外像数据集的元信息在JSONL 中，而其中的字段指向 S3 多媒体资源，这样支持吗？
```jsonl
{"id": 1, "image_url": "s3://my-bucket/images/img1.jpg"}
{"id": 2, "image_url": "s3://my-bucket/images/img2.jpg"}

Nov 04 '25 02:11 Young-zj

s3 dataset我们会支持一下；如果jsonl字段里面指向s3多媒体资源，可以通过download_file_mapper来实现本地下载

Nov 04 '25 17:11 cyruszhang

data-juicer data-juicer copied to clipboard

如何配置s3类型的数据

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

Additional 额外信息

data-juicer
data-juicer copied to clipboard