img2dataset
img2dataset copied to clipboard
Duplicate images in ms coco
In MSCOCO or Visual Gnome, an image has more than 1 caption, so img2dataset will download it 3 or 4 times. How to solve this problem?
img2dataset --url_list source/mscoco.parquet --input_format "parquet"\
--url_col "URL" --caption_col "TEXT" --output_format files\
--output_folder source/mscoco --processes_count 8 --thread_count 16 --image_size 224 \
--enable_wandb False
Hi, you may decide to join the caption in a single column and then use save_additional_columns option to put them in the json file next to images
@rom1504 thank for your reply. So download multiple times in MSCOCO is default setting?
@tungdop2 it seems this metadata parquet file has this issue do you want to fix it and upload a better version to huggingface?