data-juicer image_caption_mapper等类似算子使用前怎么处理自己的数据格式

Before Asking 在提问之前

[x] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
[x] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

[x] I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

我手上只有几张图片，我该怎么把他们处理成合法的输入格式呢，还是直接在process.yaml中把dataset_path写成包含图片的文件夹路径或者单张图片路径也可以呢。我看到了fmt_conversion/multimodal/ 中dj数据格式的介绍，但还是不太清楚该如何组织这些输入图片

Additional 额外信息

No response

Feb 28 '25 06:02 Crazy-JY

嗨 @Crazy-JY ，感谢你对Data-Juicer的关注与使用！

简单说需要将数据集中的单条样本组织为这里的格式。

对于你的情况的话，如果你仅需要使用image_caption_mapper对已有的几张图片进行处理，那除了几张图片外，你还需要一个数据集文件，以jsonl格式为例，你可能需要为这几张图片创建一个dataset.jsonl文件，其中对于每张图片，每个样本可简单准备为：

{
  "text": "<__dj__image>",
  "images": ["/path/to/img1"]
}

由于初始图片没有对应的caption，因此text字段处仅有一个image的特殊token作为占位符，表示这个样本中包含一张图片；images字段中则把该样本对应的图片路径放到列表里即可。

这个数据集可简单由这段代码片段生成：

import os
import jsonlines
from data_juicer.utils.mm_utils import SpecialTokens

image_dir = 'data'  # 放置图片的目录路径
dataset_file = 'dataset.jsonl'  # 数据集路径

with jsonlines.open(dataset_file, 'w') as writer:
    for fn in os.listdir(image_dir):
        writer.write({
            'text': SpecialTokens.image,  # 仅放置特殊token
            'images': [os.path.join(image_dir, fn)],  # 将图片路径放入列表
        })

生成好的dataset.jsonl文件可以填入data-juicer配置文件中的dataset_path，然后使用你需要的算子开始处理。

你可以自己尝试一下，如还有其他问题可随时交流~

Feb 28 '25 07:02 HYLcool

非常感谢！我试一下

Feb 28 '25 07:02 Crazy-JY

您好！非常感谢解决了数据格式的问题，但我在使用本地的InternVL2_5-2B 并运行image-caption-mapper算子时出现了新的问题。大致是说没有指明text或text_target，运行信息与报错内容如下：

2025-02-28 08:38:07 | INFO | data_juicer.core.executor:52 - Using cache compression method: [None] 2025-02-28 08:38:07 | INFO | data_juicer.core.executor:57 - Setting up data formatter... 2025-02-28 08:38:07 | INFO | data_juicer.core.executor:80 - Preparing exporter... 2025-02-28 08:38:07 | INFO | data_juicer.core.executor:160 - Loading dataset from data formatter... 2025-02-28 08:38:08 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats... 2025-02-28 08:38:08 | INFO | data_juicer.format.formatter:200 - There are 1 sample(s) in the original dataset. num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1. WARNING:datasets.arrow_dataset:num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1. 2025-02-28 08:38:08 | INFO | data_juicer.format.formatter:214 - 1 samples left after filtering empty text. 2025-02-28 08:38:08 | INFO | data_juicer.format.formatter:237 - Converting relative paths in the dataset to their absolute version. (Based on the directory of input dataset file) num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1. WARNING:datasets.arrow_dataset:num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1. 2025-02-28 08:38:08 | INFO | data_juicer.format.mixture_formatter:137 - sampled 1 from 1 2025-02-28 08:38:08 | INFO | data_juicer.format.mixture_formatter:143 - There are 1 in final dataset 2025-02-28 08:38:08 | INFO | data_juicer.core.executor:166 - Preparing process operators... 2025-02-28 08:38:08 | INFO | data_juicer.core.executor:194 - Processing data... 2025-02-28 08:38:08 | WARNING | data_juicer.utils.process_utils:75 - The required cuda memory:20.0GB might be more than the available cuda memory:18.77734375GB.This Op[image_captioning_mapper] might require more resource to run. num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1. WARNING:datasets.arrow_dataset:num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1. image_captioning_mapper_process: 0%| | 0/1 [00:00<?, ? examples/s]INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:vision_select_layer: -1 INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:ps_version: v2 INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:min_dynamic_patch: 1 INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:max_dynamic_patch: 12 2025-02-28 08:38:14 | INFO | logging:968 - Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:vision_select_layer: -1 INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:ps_version: v2 INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:min_dynamic_patch: 1 INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:max_dynamic_patch: 12 INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:vision_select_layer: -1 INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:ps_version: v2 INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:min_dynamic_patch: 1 INFO:transformers_modules.InternVL2_5-2B.configuration_internvl_chat:max_dynamic_patch: 12 FlashAttention2 is not installed. INFO:transformers_modules.InternVL2_5-2B.modeling_internvl_chat:num_image_token: 256 INFO:transformers_modules.InternVL2_5-2B.modeling_internvl_chat:ps_version: v2 Warning: Flash attention is not available, using eager attention instead. 2025-02-28 08:39:18 | ERROR | data_juicer.ops.base_op:67 - An error occurred in image_captioning_mapper when processing samples "{'text': ['<__dj__image>'], 'images': [['/home/dj_test_images/7060.png']]}" -- <class 'ValueError'>: You need to specify either text or text_target. image_captioning_mapper_process: 100%|##########| 1/1 [01:09<00:00, 69.49s/ examples] 2025-02-28 08:39:18 | INFO | data_juicer.core.data:226 - [1/1] OP [image_captioning_mapper] Done in 69.696s. Left 0 samples. 2025-02-28 08:39:20 | INFO | data_juicer.utils.logger_utils:227 - Processing finished with: Warnings: 1 Errors: 1 ╒═════════════════════════╤══════════════════════╤═════════════════════════════════════════════════════╤═══════════════╕ │ OP/Method │ Error Type │ Error Message │ Error Count │ ╞═════════════════════════╪══════════════════════╪═════════════════════════════════════════════════════╪═══════════════╡ │ image_captioning_mapper │ <class 'ValueError'> │ You need to specify either text or text_target. │ 1 │ ╘═════════════════════════╧══════════════════════╧═════════════════════════════════════════════════════╧═══════════════╛ Error/Warning details can be found in the log file [/data-juicer/outputs/demo-process/log/export_demo-processed.jsonl_time_20250228083755.txt] and its related log files. 2025-02-28 08:39:20 | INFO | data_juicer.core.executor:206 - All OPs are done in 71.412s. 2025-02-28 08:39:20 | INFO | data_juicer.core.executor:209 - Exporting dataset to disk... 2025-02-28 08:39:20 | INFO | data_juicer.core.exporter:111 - Exporting computed stats into a single file... 2025-02-28 08:39:20 | INFO | data_juicer.core.exporter:146 - Export dataset into a single file... Creating json from Arrow format: 0ba [00:00, ?ba/s]

Feb 28 '25 08:02 Crazy-JY

这里附上我的输入数据内容 {"text":"<__dj__image>", "images":["/home/dj_test_images/7060.png"]}

这里附上上述问题出现时的配置文件内容

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: './demos/data/demo-dataset-image.jsonl'  # path to your dataset directory or file
np: 1  # number of subprocess to process your dataset

export_path: './outputs/demo-process/demo-processed.jsonl'
text_keys: 'text'
image_key: 'images'
image_special_token: '<__dj__image>'

# process schedule
# a list of several process operators with their arguments
process:
  - image_captioning_mapper:                             
      hf_img2seq: '/home/InternVL2/InternVL2_5-2B'             
      caption_num: 1                               
      keep_candidate_mode: 'random_any'         
      keep_original_sample: true                            
      prompt: "describe the image"                                        
      prompt_key: null                                       
      mem_required: '16GB'
      trust_remote_code: true

Feb 28 '25 08:02 Crazy-JY

另外我保持输入不变时，经常出现如下情况。

没有报错和告警，但在./outputs/demo-process/demo-processed.jsonl也没有输出，不知道是不是有输出格式或者输出路径没设置或者设置有问题。信息如下： 2025-02-28 09:31:03 | INFO | data_juicer.core.executor:52 - Using cache compression method: [None] 2025-02-28 09:31:03 | INFO | data_juicer.core.executor:57 - Setting up data formatter... 2025-02-28 09:31:03 | INFO | data_juicer.core.executor:80 - Preparing exporter... 2025-02-28 09:31:03 | INFO | data_juicer.core.executor:160 - Loading dataset from data formatter... 2025-02-28 09:31:04 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats... 2025-02-28 09:31:04 | INFO | data_juicer.format.formatter:200 - There are 1 sample(s) in the original dataset. 2025-02-28 09:31:04 | INFO | data_juicer.format.formatter:214 - 1 samples left after filtering empty text. 2025-02-28 09:31:04 | INFO | data_juicer.format.formatter:237 - Converting relative paths in the dataset to their absolute version. (Based on the directory of input dataset file) 2025-02-28 09:31:04 | INFO | data_juicer.format.mixture_formatter:137 - sampled 1 from 1 2025-02-28 09:31:04 | INFO | data_juicer.format.mixture_formatter:143 - There are 1 in final dataset 2025-02-28 09:31:04 | INFO | data_juicer.core.executor:166 - Preparing process operators... 2025-02-28 09:31:04 | INFO | data_juicer.core.executor:194 - Processing data... 2025-02-28 09:31:05 | INFO | data_juicer.core.data:226 - [1/1] OP [image_captioning_mapper] Done in 0.817s. Left 0 samples. 2025-02-28 09:31:07 | INFO | data_juicer.utils.logger_utils:227 - Processing finished with: Warnings: 0 Errors: 0

Error/Warning details can be found in the log file [/data-juicer/outputs/demo-process/log/export_demo-processed.jsonl_time_20250228093051.txt] and its related log files. 2025-02-28 09:31:07 | INFO | data_juicer.core.executor:206 - All OPs are done in 2.469s. 2025-02-28 09:31:07 | INFO | data_juicer.core.executor:209 - Exporting dataset to disk... 2025-02-28 09:31:07 | INFO | data_juicer.core.exporter:111 - Exporting computed stats into a single file... 2025-02-28 09:31:07 | INFO | data_juicer.core.exporter:146 - Export dataset into a single file... Creating json from Arrow format: 0ba [00:00, ?ba/s]

Feb 28 '25 09:02 Crazy-JY

image_captioning_mapper算子里默认支持的是类似于BLIP-2这样的模型，你使用的InternVL2_5-2B这类VLM模型有自己的一套tokenization和generate或者chat的接口，所以它和这个算子的实现没有很匹配，建议你可以根据这个算子的实现和InternVL2的使用示例实现一个新算子。

后续没有输出是因为复用了第一次处理失败时的cache，在测试时可以在配置文件中设置use_cache: false来关闭cache，在大规模数据处理时再打开cache。

Feb 28 '25 09:02 HYLcool

好的，了解了，非常感谢~

Feb 28 '25 09:02 Crazy-JY

Close this stale issue.

May 06 '25 02:05 HYLcool

data-juicer data-juicer copied to clipboard

image_caption_mapper等类似算子使用前怎么处理自己的数据格式

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

Additional 额外信息

data-juicer
data-juicer copied to clipboard