icrawler icon indicating copy to clipboard operation
icrawler copied to clipboard

Google Crawler can only get around 100 images instead of 1000

Open jianjieluo opened this issue 4 years ago • 7 comments

Hi, when I used the searching URLs generated by feed() function in GoogleFeeder, I can only get around 100 images although the max_num=1000. I find that all the URLs get the same 100 results as the first URL. It seems that the ijn and start params are no use anymore. I just want to get nearly 1000 images per keyword. Is there anybody who has a solution?

def feed(self, keyword, offset, max_num, language=None, filters=None):
        base_url = 'https://www.google.com/search?'
        self.filter = self.get_filter()
        filter_str = self.filter.apply(filters, sep=',')
        for i in range(offset, offset + max_num, 100):
            params = dict(
                q=keyword,
                ijn=int(i / 100),
                start=i,
                tbs=filter_str,
                tbm='isch')
            if language:
                params['lr'] = 'lang_' + language
            url = base_url + urlencode(params)
            self.out_queue.put({'url': url, 'keyword': keyword, 'next_offset': i+100})
            self.logger.debug('put url to url_queue: {}'.format(url))

jianjieluo avatar Aug 16 '20 14:08 jianjieluo

I think your problem might be related to #38 .

vogelbam avatar Aug 17 '20 17:08 vogelbam

@vogelbam hi, thanks for your reply. However, I find that the date_min argument was removed in the docs after #38 issue. What's worse, search image by date doesn't work any more #78. I have tried to search with different date ranges but it failed. It seems that the URL param below doesn't work anymore.

https://github.com/hellock/icrawler/blob/1acbb9608191de963de9ffd8bf27dff4f5cba3ab/icrawler/builtin/google.py#L114

jianjieluo avatar Aug 17 '20 18:08 jianjieluo

Same issue for me It seems that the paging method is not working correctly and only the first page is processed . for example to crawl car images the URL of the first page is: https://www.google.com/search?q=car&ijn=0&start=0&tbm=isch this page is ok and the crawler can fetch around 100 images. for the next pages the URL is: https://www.google.com/search?q=car&ijn=1&start=100&tbm=isch https://www.google.com/search?q=car&ijn=2&start=200&tbm=isch ... parsing these pages does not return any results. Also, I've checked these pages in my browser, and all return the same results of the first page.

r-y-zadeh avatar Aug 20 '20 01:08 r-y-zadeh

I have just looked into it for a bit and it seems goolge is now updating the result page through a post request like this https://www.google.com/imgevent?ei=vimhX4KlHOqYr7wPsP6YwAk&iact=ms&forward=1&ct=vfe_scroll&scroll=1400&page=1&start=24&ndsp=4&bih=1830&biw=389

ZhiyuanChen avatar Nov 03 '20 10:11 ZhiyuanChen

Is this problem fixed now? I have the same issue and hope to download more pictures.

ManiaaJia avatar Jan 25 '22 10:01 ManiaaJia

It seems that Google's algorithm may causes to crawl fewer resources than expected. I brute-forcely solved this problem by setting disjoint date argument iteratively like:

from icrawler.builtin import GoogleImageCrawler
import datetime

n_total_images = 10000
n_per_crawl = 100

delta = datetime.timedelta(days=30)
end_day = datetime.datetime(2022, 9, 29)

def datetime2tuple(date):
    return (date.year, date.month, date.day)

for i in range(int(n_total_images / n_per_crawl )):
    start_day = end_day - delta
    google_crawler = GoogleImageCrawler(downloader_threads=4, storage={'root_dir': '/path/to/image'})
    google_crawler.crawl(keyword='<YOUR_KEYWORDS>', filters={'date':(datetime2tuple(start_day), datetime2tuple(end_day))}, file_idx_offset=i*n_per_crawl , max_num=n_per_crawl)
    end_day = start_day - datetime.timedelta(days=1)

Edit: Note that this method may causes image duplication. You should postprocess the collected images. FYI, I use imagededup python library, which is CNN-based duplicated image detector.

somisawa avatar Sep 29 '22 05:09 somisawa

you may get 2000 perfectly.

hasnatsakil avatar Jun 18 '23 14:06 hasnatsakil