data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!

Results 117 data-juicer issues
Sort by recently updated
recently updated
newest added

After removing two lines of code in PR #597, there is an issue for Sandbox that could not find `work_dir` in later steps. It's hard to resolve this issue by...

bug
dj:core

comprehensive work for deduping benchmark. - dj recipes for baseline datasets - tunable toolkit for synthetic duplicate data generation - more algos for deduping - boilerplate to run the tests

尊敬的Data-Juicer框架开发者,你们好。最近,我们有对大模型数据进行处理的需求。从论文“Data-Juicer: A One-Stop Data Processing System for Large Language Models”调研到Data-Juicer的开源大模型数据处理框架。我们想进一步使用和探索这个框架。正好,我们看到了你们在天池比赛中发布了“FT-Data Ranker_大语言模型微调数据赛(7B模型赛道)”比赛。但是比赛已经结束无法获取原始数据。是否可以提供原始数据以供我们探索和使用Data-Juicer框架。万分感谢🙏。

question

### Search before continuing 先搜索,再继续 - [x] I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。 ### Description 描述 希望进行新算子的研发。 搭配ai语义理解,遮蔽存折账号、证券账户、电子钱包ID(支付宝/微信支付账号)身份证号、护照号、驾驶证号,手机号、家庭住址(精确到门牌号)等PII身份敏感信息。 ###...

enhancement
good first issue

### Question 环境: windows10 anaconda 虚拟环境 python3.10.16 源码安装,初始化完成后执行 python tools/process_data.py --config configs/demo/process.yaml 报错: ![Image](https://github.com/user-attachments/assets/e7d6a96b-cf46-4864-8a40-9fd395888051) ### Additional 额外信息 _No response_

question

### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...

question

### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...

question

### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...

question
dj:efficiency

环境: python3.10 ubuntu20.04 报错如下: ![Image](https://github.com/user-attachments/assets/eb005c18-4ad8-4355-acd8-e3115807f373)

question

### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...

question