icrawler
icrawler copied to clipboard
Google Crawler can only get around 100 images instead of 1000
Hi, when I used the searching URLs generated by feed()
function in GoogleFeeder
, I can only get around 100 images although the max_num=1000
. I find that all the URLs get the same 100 results as the first URL. It seems that the ijn
and start
params are no use anymore. I just want to get nearly 1000 images per keyword. Is there anybody who has a solution?
def feed(self, keyword, offset, max_num, language=None, filters=None):
base_url = 'https://www.google.com/search?'
self.filter = self.get_filter()
filter_str = self.filter.apply(filters, sep=',')
for i in range(offset, offset + max_num, 100):
params = dict(
q=keyword,
ijn=int(i / 100),
start=i,
tbs=filter_str,
tbm='isch')
if language:
params['lr'] = 'lang_' + language
url = base_url + urlencode(params)
self.out_queue.put({'url': url, 'keyword': keyword, 'next_offset': i+100})
self.logger.debug('put url to url_queue: {}'.format(url))
I think your problem might be related to #38 .
@vogelbam hi, thanks for your reply. However, I find that the date_min
argument was removed in the docs after #38 issue. What's worse, search image by date doesn't work any more #78. I have tried to search with different date ranges but it failed. It seems that the URL param below doesn't work anymore.
https://github.com/hellock/icrawler/blob/1acbb9608191de963de9ffd8bf27dff4f5cba3ab/icrawler/builtin/google.py#L114
Same issue for me It seems that the paging method is not working correctly and only the first page is processed . for example to crawl car images the URL of the first page is: https://www.google.com/search?q=car&ijn=0&start=0&tbm=isch this page is ok and the crawler can fetch around 100 images. for the next pages the URL is: https://www.google.com/search?q=car&ijn=1&start=100&tbm=isch https://www.google.com/search?q=car&ijn=2&start=200&tbm=isch ... parsing these pages does not return any results. Also, I've checked these pages in my browser, and all return the same results of the first page.
I have just looked into it for a bit and it seems goolge is now updating the result page through a post request like this https://www.google.com/imgevent?ei=vimhX4KlHOqYr7wPsP6YwAk&iact=ms&forward=1&ct=vfe_scroll&scroll=1400&page=1&start=24&ndsp=4&bih=1830&biw=389
Is this problem fixed now? I have the same issue and hope to download more pictures.
It seems that Google's algorithm may causes to crawl fewer resources than expected. I brute-forcely solved this problem by setting disjoint date
argument iteratively like:
from icrawler.builtin import GoogleImageCrawler
import datetime
n_total_images = 10000
n_per_crawl = 100
delta = datetime.timedelta(days=30)
end_day = datetime.datetime(2022, 9, 29)
def datetime2tuple(date):
return (date.year, date.month, date.day)
for i in range(int(n_total_images / n_per_crawl )):
start_day = end_day - delta
google_crawler = GoogleImageCrawler(downloader_threads=4, storage={'root_dir': '/path/to/image'})
google_crawler.crawl(keyword='<YOUR_KEYWORDS>', filters={'date':(datetime2tuple(start_day), datetime2tuple(end_day))}, file_idx_offset=i*n_per_crawl , max_num=n_per_crawl)
end_day = start_day - datetime.timedelta(days=1)
Edit: Note that this method may causes image duplication. You should postprocess the collected images. FYI, I use imagededup python library, which is CNN-based duplicated image detector.
you may get 2000 perfectly.