bing_image_downloader Duplicate Images

I am trying to create a food dataset. However, when I try to scrape from Bing using this library, I am getting a lot of duplicate images. Please assist.

Thank you

Oct 09 '20 11:10 ansariyusuf

My first attempt to filter out duplicates would be to subtract two possible duplicated images and check if the difference is close to zero.

Oct 18 '20 16:10 NickT5

I'm getting the same. Downloaded 10000 pictures and 9789 of them were duplicates. Is this a nature of Bing image search, or particular to this downloader?

Dec 06 '20 01:12 atsbomb

When I scrape 100 photos, after the first 85 to 90 images, they start to repeat, and the rest are all duplicates. When I scrape 500 photos, 370 of them are duplicates :( Other than this it works great, so I really hope this issue can get fixed.

Jan 24 '21 02:01 jane-cz

Ya I also faced same issue it was due to how its programed i.e there is no next page in bing so instead first=pagecounter -> do first len of total url visited also added ignore duplicates if same url is already visited i will also pull the code or you can visit https://github.com/AbhiDhariwal/bing_image_downloader

Feb 03 '21 20:02 AbhiDhariwal

I successfully avoided duplicated images with the following code. But now it will search forever. So yeah, maybe we need a next button for more images.

` self.duplicates = set()

def save_image(self, link, file_path):
    request = urllib.request.Request(link, None, self.headers)
    image = urllib.request.urlopen(request, timeout=self.timeout).read()
    
    if not imghdr.what(None, image) or image in self.duplicates:
        print('[Error]Invalid image, not saving {}\n'.format(link))
        raise
    else:
        self.duplicates.add(image)

    with open(file_path, 'wb') as f:
        f.write(image)

`

Apr 01 '21 16:04 shoppel

Remove duplicates PR#20

Sep 18 '21 15:09 sid7631

Bumping this as an issue. The fix above looks like it works and would be great if merged. Thanks!

Dec 31 '21 20:12 annabaringer

Please close this issue

Mar 14 '22 16:03 sid7631

bing_image_downloader bing_image_downloader copied to clipboard

Duplicate Images

bing_image_downloader
bing_image_downloader copied to clipboard