data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!

Results 117 data-juicer issues
Sort by recently updated
recently updated
newest added

### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...

question

### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...

question

As the title says, this op calculates the in-group diversity for a batch of samples. Here's the breakdown: - It first converts all input samples into embedding vectors. - Then,...

Introduces a novel data selection op based on semantic diversity across domains, designed to automatically select the most diverse subset of data samples, which is inspired by the [DaaR paper](https://arxiv.org/abs/2502.04380)....

Built upon the mmdet3d [Inferencer](https://github.com/open-mmlab/mmdetection3d/blob/main/mmdet3d/apis/inferencers/lidar_seg3d_inferencer.py).

TypeError: 'NoneType' object is not callable Traceback: File "D:\python_code\pdf_process\data-juicer\demos\process_cft_zh_data\app.py", line 231, in main() File "D:\python_code\pdf_process\data-juicer\demos\process_cft_zh_data\app.py", line 227, in main Visualize.visualize() File "D:\python_code\pdf_process\data-juicer\demos\process_cft_zh_data\app.py", line 223, in visualize Visualize.analyze_process() File "D:\python_code\pdf_process\data-juicer\demos\process_cft_zh_data\app.py", line...

bug

### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...

question

### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...

question

### Before Asking 在提问之前 - [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully. 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引。 - [x] I have pulled the latest code of main branch to run again and...

question