bing_image_downloader
bing_image_downloader copied to clipboard
Duplicate Images
I am trying to create a food dataset. However, when I try to scrape from Bing using this library, I am getting a lot of duplicate images. Please assist.
Thank you
My first attempt to filter out duplicates would be to subtract two possible duplicated images and check if the difference is close to zero.
I'm getting the same. Downloaded 10000 pictures and 9789 of them were duplicates. Is this a nature of Bing image search, or particular to this downloader?
When I scrape 100 photos, after the first 85 to 90 images, they start to repeat, and the rest are all duplicates. When I scrape 500 photos, 370 of them are duplicates :( Other than this it works great, so I really hope this issue can get fixed.
Ya I also faced same issue it was due to how its programed i.e there is no next page in bing so instead first=pagecounter -> do first len of total url visited also added ignore duplicates if same url is already visited i will also pull the code or you can visit https://github.com/AbhiDhariwal/bing_image_downloader
I successfully avoided duplicated images with the following code. But now it will search forever. So yeah, maybe we need a next button for more images.
` self.duplicates = set()
def save_image(self, link, file_path):
request = urllib.request.Request(link, None, self.headers)
image = urllib.request.urlopen(request, timeout=self.timeout).read()
if not imghdr.what(None, image) or image in self.duplicates:
print('[Error]Invalid image, not saving {}\n'.format(link))
raise
else:
self.duplicates.add(image)
with open(file_path, 'wb') as f:
f.write(image)
`
Remove duplicates PR#20
Bumping this as an issue. The fix above looks like it works and would be great if merged. Thanks!
Please close this issue