data-juicer
data-juicer copied to clipboard
A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
After removing two lines of code in PR #597, there is an issue for Sandbox that could not find `work_dir` in later steps. It's hard to resolve this issue by...
comprehensive work for deduping benchmark. - dj recipes for baseline datasets - tunable toolkit for synthetic duplicate data generation - more algos for deduping - boilerplate to run the tests
尊敬的Data-Juicer框架开发者,你们好。最近,我们有对大模型数据进行处理的需求。从论文“Data-Juicer: A One-Stop Data Processing System for Large Language Models”调研到Data-Juicer的开源大模型数据处理框架。我们想进一步使用和探索这个框架。正好,我们看到了你们在天池比赛中发布了“FT-Data Ranker_大语言模型微调数据赛(7B模型赛道)”比赛。但是比赛已经结束无法获取原始数据。是否可以提供原始数据以供我们探索和使用Data-Juicer框架。万分感谢🙏。
### Search before continuing 先搜索,再继续 - [x] I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。 ### Description 描述 希望进行新算子的研发。 搭配ai语义理解,遮蔽存折账号、证券账户、电子钱包ID(支付宝/微信支付账号)身份证号、护照号、驾驶证号,手机号、家庭住址(精确到门牌号)等PII身份敏感信息。 ###...
### Question 环境: windows10 anaconda 虚拟环境 python3.10.16 源码安装,初始化完成后执行 python tools/process_data.py --config configs/demo/process.yaml 报错:  ### Additional 额外信息 _No response_
### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...
### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...
### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...
环境: python3.10 ubuntu20.04 报错如下: 
### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...