data-juicer
data-juicer copied to clipboard
A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
### Before Asking 在提问之前 - [X] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [X] I have pulled the latest code of main branch to run again and...
1. add data executor hook 2. add data partition executor
### Before Asking 在提问之前 - [X] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [X] I have pulled the latest code of main branch to run again and...
Add video_face_ratio_filters
Switch face detections to `opencv` implementation. Remove dependency on `dlib`. **Note:** All opencv-related OPs cannot run in multiprocessing with `fork` mode, which should be considered during refactoring.
Add source tags to the newly generated images, videos, and audio in some mapper op.
### Before Asking 在提问之前 - [X] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [X] I have pulled the latest code of main branch to run again and...
我想自定义一个算子,这个算子可以同时处理三个字段,text_key="instruct","input","output"的方法貌似不成功
the modelscope newestlib unable to download it any more. Alos, is there a simple way to download the videos?
### Before Reporting 报告之前 - [X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。 - [X] I have read the...