data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!

Results 117 data-juicer issues
Sort by recently updated
recently updated
newest added

比如 `{"text":"包含同音字替换测试:今天天气很好,我们去公园玩."}` 这个可以 `{"info":"文本", "text":"包含同音字替换测试:今天天气很好,我们去公园玩."}` 就不可以 使用中报错了 ``` 2025-11-11 05:33:10.943 | ERROR | data_juicer.core.data.dj_dataset:317 - An error occurred during Op [nlpcda_zh_mapper]. Traceback (most recent call last): File "/data-juicer/data_juicer/core/data/dj_dataset.py", line 297,...

### Before Reporting 报告之前 - [x] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。 - [x] I have read the...

bug

per request: https://github.com/modelscope/data-juicer/issues/799 supports - dataset_path: s3://mnt/dst/the-pile-philpaper-refine-result.jsonl - .env or environment variable for aws credentials - ray or default mode - added sample config - support s3 exporting

`video_whole_body_pose_estimation_mapper`: Input a video containing people, and use the **DWPose** model to extract the body, hand, feet, and face keypoints of the human subjects in the video, i.e., 2D Whole-body...

enhancement
dj:multimodal
dj:op

### Search before continuing 先搜索,再继续 - [x] I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。 ### Description 描述 where is the...

enhancement

### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...

question

Add two video analysis ops, including: - `video_object_segmenting_mapper`: Performs text-guided semantic segmentation of valid objects throughout the video (using YOLOE and SAM2), with support for saving segmentation visualization results. -...

enhancement
dj:multimodal
dj:op

### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...

question

### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...

question