data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!

Results 117 data-juicer issues
Sort by recently updated
recently updated
newest added

### Before Reporting 报告之前 - [X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。 - [X] I have read the...

bug

Update KDD tutorials to the latest version of Data-Juicer. And merge them into the main branch if it's OK. Refer: ### Discussed in https://github.com/modelscope/data-juicer/discussions/475 Originally posted by **Tendo33** November 6,...

bug
documentation

### Search before continuing 先搜索,再继续 - [X] I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。 ### Description 描述 Currently, some LLM-dependent...

enhancement

### Search before continuing 先搜索,再继续 - [X] I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。 ### Description 描述 Enhance the installation...

enhancement

### Search before continuing 先搜索,再继续 - [X] I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。 ### Description 描述 To address potential...

enhancement

### Search before continuing 先搜索,再继续 - [X] I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。 ### Description 描述 Currently, users may...

enhancement

As the title says. * remove sandbox-related code and configs * remove deps * update docs * move hpo and quality_classifier tools into the internal tools

documentation
enhancement
dj:core

- `video_hand_reconstruction_mapper:` Use the WiLoR model for hand localization and reconstruction.

enhancement
dj:multimodal
dj:op

### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...

question

design doc (internal): https://aliyuque.antfin.com/ah7ri9/zdesop/qw3tm08a5wcqx446 概述 Data-Juicer 分区、检查点和事件日志系统为处理大型数据集提供了全面的解决方案,具备容错性、可扩展性和完整的可观测性。 设计初衷 Ray会有一些容错能力(actor的persistence机制,以及task级别的重试逻辑);Ray-DLC也会提供更好的[异常容错和自愈](https://aliyuque.antfin.com/pai-innovative-algo/reep8g/tpookoiaxd4m80cg?singleDoc#);但是还是存在一些系统性的问题: ● 整体执行问题:ray将整个数据集作为一个整体单元来处理;如果一小部分失败了,整个OP stage乃至pipeine就失败 ● 进度恢复空白:整个流程作为个整体来操作的,一个部分错了就会需要全部重跑 ● 没有用户可配置的细粒度的容错方式,缺少灵活性 ● 数据持久化和映射:这个目前是空缺的;actor可以提供入口,但是目前DJ框架没有支持 ● 可观测性不够:ray只有集群状态,对于dj任务的状态还是缺少了更好的观测 所以我们希望能够通过一整套分区、检查点、事件日志的逻辑,把这些问题都解决 主要功能 ● 容错性: 使用检查点自动从故障中恢复 ● 可扩展性: 基于分区的处理,适用于任何规模的数据集 ● 可观测性:...