icrawler Google Crawler can only get around 100 images instead of 1000

Hi, when I used the searching URLs generated by feed() function in GoogleFeeder, I can only get around 100 images although the max_num=1000. I find that all the URLs get the same 100 results as the first URL. It seems that the ijn and start params are no use anymore. I just want to get nearly 1000 images per keyword. Is there anybody who has a solution?

def feed(self, keyword, offset, max_num, language=None, filters=None):
        base_url = 'https://www.google.com/search?'
        self.filter = self.get_filter()
        filter_str = self.filter.apply(filters, sep=',')
        for i in range(offset, offset + max_num, 100):
            params = dict(
                q=keyword,
                ijn=int(i / 100),
                start=i,
                tbs=filter_str,
                tbm='isch')
            if language:
                params['lr'] = 'lang_' + language
            url = base_url + urlencode(params)
            self.out_queue.put({'url': url, 'keyword': keyword, 'next_offset': i+100})
            self.logger.debug('put url to url_queue: {}'.format(url))

Aug 16 '20 14:08 jianjieluo

I think your problem might be related to #38 .

Aug 17 '20 17:08 vogelbam

@vogelbam hi, thanks for your reply. However, I find that the date_min argument was removed in the docs after #38 issue. What's worse, search image by date doesn't work any more #78. I have tried to search with different date ranges but it failed. It seems that the URL param below doesn't work anymore.

https://github.com/hellock/icrawler/blob/1acbb9608191de963de9ffd8bf27dff4f5cba3ab/icrawler/builtin/google.py#L114

Aug 17 '20 18:08 jianjieluo

Same issue for me It seems that the paging method is not working correctly and only the first page is processed . for example to crawl car images the URL of the first page is: https://www.google.com/search?q=car&ijn=0&start=0&tbm=isch this page is ok and the crawler can fetch around 100 images. for the next pages the URL is: https://www.google.com/search?q=car&ijn=1&start=100&tbm=isch https://www.google.com/search?q=car&ijn=2&start=200&tbm=isch ... parsing these pages does not return any results. Also, I've checked these pages in my browser, and all return the same results of the first page.

Aug 20 '20 01:08 r-y-zadeh

I have just looked into it for a bit and it seems goolge is now updating the result page through a post request like this https://www.google.com/imgevent?ei=vimhX4KlHOqYr7wPsP6YwAk&iact=ms&forward=1&ct=vfe_scroll&scroll=1400&page=1&start=24&ndsp=4&bih=1830&biw=389

Nov 03 '20 10:11 ZhiyuanChen

Is this problem fixed now? I have the same issue and hope to download more pictures.

Jan 25 '22 10:01 ManiaaJia

It seems that Google's algorithm may causes to crawl fewer resources than expected. I brute-forcely solved this problem by setting disjoint date argument iteratively like:

from icrawler.builtin import GoogleImageCrawler
import datetime

n_total_images = 10000
n_per_crawl = 100

delta = datetime.timedelta(days=30)
end_day = datetime.datetime(2022, 9, 29)

def datetime2tuple(date):
    return (date.year, date.month, date.day)

for i in range(int(n_total_images / n_per_crawl )):
    start_day = end_day - delta
    google_crawler = GoogleImageCrawler(downloader_threads=4, storage={'root_dir': '/path/to/image'})
    google_crawler.crawl(keyword='<YOUR_KEYWORDS>', filters={'date':(datetime2tuple(start_day), datetime2tuple(end_day))}, file_idx_offset=i*n_per_crawl , max_num=n_per_crawl)
    end_day = start_day - datetime.timedelta(days=1)

Edit: Note that this method may causes image duplication. You should postprocess the collected images. FYI, I use imagededup python library, which is CNN-based duplicated image detector.

Sep 29 '22 05:09 somisawa

you may get 2000 perfectly.

Jun 18 '23 14:06 hasnatsakil

icrawler icrawler copied to clipboard

Google Crawler can only get around 100 images instead of 1000

icrawler
icrawler copied to clipboard