data-juicer
data-juicer copied to clipboard
A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Would it be more appropriate to change the return type of the read, read_json, and read_webdataset functions in the RayDataSet class from RayDataSet to ray.data.Dataset? Because the data returned by...
### Before Reporting 报告之前 - [x] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。 - [x] I have read the...
你好,最近使用data-juicer,由于公司的网络是隔离的,不能访问外网,请问可以把能用的模型放到一个文件夹?方便一次性下载上传到内部空间。不慎感激。如果用docker镜像还需要下载这些模型?
### Before Reporting 报告之前 - [x] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。 - [x] I have read the...
This PR introduces significant performance optimizations for parallel data processing using Ray Actors and multi-threading. The following changes have been implemented: 1. **Dynamic Resource Allocation**: - Multiple Actors per operator...
### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...
### Search before continuing 先搜索,再继续 - [x] I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。 ### Description 描述 Hi! This is...