img2dataset icon indicating copy to clipboard operation
img2dataset copied to clipboard

On LAION 400MB found 450K corrupted images

Open visualdatabase opened this issue 3 years ago • 9 comments

0_outtmplaiondatasets24565 tar245657488 1_outtmplaiondatasets08475 tar084751105 2_outtmplaiondatasets15564 tar155644158 3_outtmplaiondatasets07973 tar079738057

Did anyone experience this issue? Download and resize steps finished fine (to size 224x224)

This is the command I used:

img2dataset --url_list older --input_format "parquet"\
         --url_col "URL" --caption_col "TEXT" --output_format webdataset\
           --output_folder your_output_folder5 --processes_count 64 --thread_count 32 --image_size 224\
             --enable_wandb False

Resulting image sizes are between 1000 - 3000 bytes with 1411 very frequent.

visualdatabase avatar Nov 22 '22 09:11 visualdatabase

Can you check if it's a problem of the source (that would be more or less normal) or a problem of the tool ? For example try to find one such url.

rom1504 avatar Nov 22 '22 12:11 rom1504

I would love to test that, but I am not sure how do I link between the metadata from the parquet, and the numbering of the image and tar names. Since it seems the numbering is consecutive and some images failed during download so their images are skipped Any documentation on that?

dbickson avatar Nov 22 '22 13:11 dbickson

If you have the save_additional_column option when downloading (see examples) then the url will be contained in .json file next to images in the tar.

rom1504 avatar Nov 22 '22 16:11 rom1504

actually that gets saved even without that option

rom1504 avatar Nov 22 '22 16:11 rom1504

HI @rom1504 thanks for your guidance I was able to find the urls and confirm that the original images where fine, here are some examples of original images that got messed up:

2184,pois_rouge_fond_orange,http://s3.amazonaws.com/spoonflower/public/design_thumbnails/0052/7847/rrpois_rouge_fond_orange_shop_preview.png,398762169,success,,224.0,224.0,470.0,403.0,{},372694e5ab0eee9ea7a16491745c061f
2293,The Big Lebowski,http://images5.fanpop.com/image/photos/30900000/The-Big-Lebowski-the-big-lebowski-30926827-357-500.png,398762299,success,,224.0,224.0,357.0,500.0,{},e3cd92ef2df679f424e1b87428db4d02
2614,FEBRUARY_SPRING_SERIES_SPOONFLOWER,http://s3.amazonaws.com/spoonflower/public/design_thumbnails/0183/4917/rFEBRUARY_SPRING_SERIES_SPOONFLOWER_shop_preview.png,398762616,success,,224.0,224.0,470.0,403.0,{},79efbe531ab6c9315d6a27a6eeeb7360
3061,Catelyn Tully Stark - catelyn-tully-stark fan art,http://images6.fanpop.com/image/photos/33700000/Catelyn-Tully-Stark-catelyn-tully-stark-33727457-89-120.png,398763067,success,,224.0,224.0,89.0,120.0,{},54dc4776033bde2ddd6a17c87ca6fdf2
6570,The Austrian Ocean Race Project,https://i0.wp.com/ocean-racing.at/wp-content/uploads/2020/06/Logo_OceanRacing_CLAIM-NEU_NEG.png?fit=827%2C463&ssl=1,398766525,success,,224.0,224.0,827.0,463.0,{},4a801705ba1722142d3e0385749a880b
6589,"Logo Sabine Reitmayer-Wawer Business Coaching, Sparring und Facilitation",https://www.wawer.at/wp-content/uploads/2020/07/Sabine-Reitmayer-Wawer-LOGO-Wei%C3%9F-auf-transparent-Coaching-Sparring-Facilitation-600x200.png,398766578,success,,224.0,224.0,600.0,200.0,{},fc8d4126d4c50a552bdafcf35f68d391
7328,"Pro Safety & Rescue, Inc.",https://cdn.shopify.com/s/files/1/0739/7233/t/23/assets/logo.png?1988037197299676201,398767307,success,,224.0,224.0,560.0,170.0,{},6024c603493e923a5f6fcea199bc2db8
7396,Graphing Impact and Urgency of tasks,https://mattheweppelsheimer.com/wp-content/uploads/2014/12/Value-Quadrants-Diagram-2-Impact-and-Urgency-1024x768.png,398767411,success,,224.0,224.0,1024.0,768.0,{},01afe84329fd0a6a8c9f19a6bb2f197f
7657,rainbow green fabric by fabricfaeries on Spoonflower - custom fabric,http://s3.amazonaws.com/spoonflower/public/design_thumbnails/0129/3756/rrrainbow_shop_preview.png,398767664,success,,224.0,224.0,470.0,403.0,{},2e70acd73e829b5ece85e751a9c1cb62

For example link is http://studybibleforwomen.com/wp-content/themes/studybibleforwomen/images/study-bible-for-women.png, key is 398761167, tar is 39876 image is http://studybibleforwomen.com/wp-content/themes/studybibleforwomen/images/study-bible-for-women.png,398761167.jpg

Image after resize (to 224x224) is only 2189 bytes.

The original image details: file study-bible-for-women.png study-bible-for-women.png: PNG image data, 144 x 121, 8-bit/color RGBA, non-interlaced

The original image size was 6411. It seems resize somehow messed up the content of the image.

398761167 study-bible-for-women

dbickson avatar Nov 23 '22 16:11 dbickson

Note that image is RGBA. Maybe that is the source of the issue?

dbickson avatar Nov 23 '22 16:11 dbickson

could be yeah, do you feel like adding a few such images in https://github.com/rom1504/img2dataset/tree/main/tests/resize_test_image and running the dedicated unit test for resizer ? will be easier to reproduce and understand there

rom1504 avatar Nov 23 '22 18:11 rom1504

Hi @rom1504 I have run the unit test by changing the following code. I put the attached image into the resized image test folder and remove everything else form there. There is no error returned but there is a messed up image resulting. I suspect it is related to RGBA channel or image size or both.

@@ -67,16 +67,18 @@ def test_resizer_filter():
     test_folder = current_folder + "/" + "resize_test_image"
     image_paths = glob.glob(test_folder + "/*")
     resizer = Resizer(
-        image_size=256, resize_mode="no", resize_only_if_bigger=True, min_image_size=200, max_aspect_ratio=1.5
+        image_size=224, resize_mode="keep_ratio", resize_only_if_bigger=False, min_image_size=2, max_aspect_ratio=200, encode_format='jpg'
     )
     errors = []
     for image_path in image_paths:
         with open(image_path, "rb") as f:
             img = f.read()
             image_original_stream = io.BytesIO(img)
-        _, _, _, _, _, err = resizer(image_original_stream)
+        img, _, _, _, _, err = resizer(image_original_stream)
+        with open("stam.jpg", "wb") as f:
+            f.write(img)
         errors.append(err)
-    expected_errors = [(None, 2), ("image too small", 2), ("aspect ratio too large", 3)]
+    expected_errors = [(None, 1)]
     for expected_error, count in expected_errors:
         assert count == errors.count(expected_error)

Here is the image that is not resized properly. study-bible-for-women

the resulting file has the right dimension but messed up content

azureuser@ubuntu-16tb:/mount/img2dataset$ file stam.jpg
stam.jpg: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 267x224, components 3

dbickson avatar Nov 26 '22 08:11 dbickson

interested by a fix

rom1504 avatar Dec 21 '22 04:12 rom1504