img2dataset icon indicating copy to clipboard operation
img2dataset copied to clipboard

Duplicate images in ms coco

Open tungdop2 opened this issue 2 years ago • 3 comments

In MSCOCO or Visual Gnome, an image has more than 1 caption, so img2dataset will download it 3 or 4 times. How to solve this problem?

img2dataset --url_list source/mscoco.parquet --input_format "parquet"\
         --url_col "URL" --caption_col "TEXT" --output_format files\
           --output_folder source/mscoco --processes_count 8 --thread_count 16 --image_size 224 \
             --enable_wandb False

tungdop2 avatar Dec 16 '22 10:12 tungdop2

Hi, you may decide to join the caption in a single column and then use save_additional_columns option to put them in the json file next to images

rom1504 avatar Dec 17 '22 00:12 rom1504

@rom1504 thank for your reply. So download multiple times in MSCOCO is default setting?

tungdop2 avatar Dec 19 '22 01:12 tungdop2

@tungdop2 it seems this metadata parquet file has this issue do you want to fix it and upload a better version to huggingface?

rom1504 avatar Dec 19 '22 15:12 rom1504