bing_image_downloader icon indicating copy to clipboard operation
bing_image_downloader copied to clipboard

Duplicate Images

Open ansariyusuf opened this issue 4 years ago • 8 comments

I am trying to create a food dataset. However, when I try to scrape from Bing using this library, I am getting a lot of duplicate images. Please assist.

Thank you

ansariyusuf avatar Oct 09 '20 11:10 ansariyusuf

My first attempt to filter out duplicates would be to subtract two possible duplicated images and check if the difference is close to zero.

NickT5 avatar Oct 18 '20 16:10 NickT5

I'm getting the same. Downloaded 10000 pictures and 9789 of them were duplicates. Is this a nature of Bing image search, or particular to this downloader?

atsbomb avatar Dec 06 '20 01:12 atsbomb

When I scrape 100 photos, after the first 85 to 90 images, they start to repeat, and the rest are all duplicates. When I scrape 500 photos, 370 of them are duplicates :( Other than this it works great, so I really hope this issue can get fixed.

jane-cz avatar Jan 24 '21 02:01 jane-cz

Ya I also faced same issue it was due to how its programed i.e there is no next page in bing so instead first=pagecounter -> do first len of total url visited also added ignore duplicates if same url is already visited i will also pull the code or you can visit https://github.com/AbhiDhariwal/bing_image_downloader

AbhiDhariwal avatar Feb 03 '21 20:02 AbhiDhariwal

I successfully avoided duplicated images with the following code. But now it will search forever. So yeah, maybe we need a next button for more images.

` self.duplicates = set()

def save_image(self, link, file_path):
    request = urllib.request.Request(link, None, self.headers)
    image = urllib.request.urlopen(request, timeout=self.timeout).read()
    
    if not imghdr.what(None, image) or image in self.duplicates:
        print('[Error]Invalid image, not saving {}\n'.format(link))
        raise
    else:
        self.duplicates.add(image)

    with open(file_path, 'wb') as f:
        f.write(image)

`

shoppel avatar Apr 01 '21 16:04 shoppel

Remove duplicates PR#20

sid7631 avatar Sep 18 '21 15:09 sid7631

Bumping this as an issue. The fix above looks like it works and would be great if merged. Thanks!

annabaringer avatar Dec 31 '21 20:12 annabaringer

Please close this issue

sid7631 avatar Mar 14 '22 16:03 sid7631