googliser icon indicating copy to clipboard operation
googliser copied to clipboard

zero results (again)

Open teracow opened this issue 5 years ago • 8 comments

Yes, Google have updated their page-code again, so some new regexes are needed to scrape the links.

Working on it now ...

teracow avatar Feb 09 '20 08:02 teracow

Threw a quick scraper together that seems to work (haven't pushed it up to here yet).

But it's only finding a maximum of 104 unique images across 10 pages. Hmm ... have to keep looking. Unfortunately, I'm out of time now, so I'll keep looking tomorrow.

Google have certainly advanced their page-code. It gets harder each time to extract the original image URLs. :laughing:

teracow avatar Feb 09 '20 08:02 teracow

OK we're out of action for now. I'll need to decode the endless-page scripting in order to request more than a single page of image results.

I'm not in a coding-cycle at the moment, and I'm unable to say when I'll be able to get around to this. Hopefully, it'll be the next time I have a few days free. :disappointed:

teracow avatar Feb 10 '20 00:02 teracow

If anyone would like to have a shot at fixing this, you're more than welcome. :grin:

The current issue is: I can scrape the new results page, but can't trigger the endless page scrolling. So, if I separately request 10 pages of results, I actually get the first page x 10 times (with the same 100-or-so results listed on that first page).

teracow avatar Feb 11 '20 06:02 teracow

I've pushed the new scraper to GitHub, so at least results from the first page can be found.

Now need to work out how to request the rest of the results pages (again).

teracow avatar Feb 11 '20 21:02 teracow

Your scraper is the fastest I found, thanks! Compared to iCrawler and google-images-download which are also struggling with the Google code change, you have at least have made it work for one page (approx. 40 img)!

What I suggest as a temporary workaround is to implement the parameters below to your parameters list like this;

--adjusted-period-min [PRESET] 
--adjusted-period-max [PRESET] 

The idea is that this should allow dowload images for multiple specified periods and thus requesting multiple pages for each class. if I do this 10 times for each class I will have 400 images per class, which is curently enough for me. Do it 20 times and you'll have your 800 again.

Unfortunately I have not the skills to produce the above suggestion...., otherwise I would have contributed more instead of only suggesting what to do :) . Hope the idea helps to solve the issue soon though.

LeaTaka avatar Feb 19 '20 09:02 LeaTaka

The idea is that this should allow dowload images for multiple specified periods and thus requesting multiple pages for each class. if I do this 10 times for each class I will have 400 images per class, which is curently enough for me. Do it 20 times and you'll have your 800 again.

That's an interesting idea. :nerd_face:

But I'm not sure what you mean by specified periods. Do you mean the Google search parameter called 'time'?

teracow avatar Feb 19 '20 19:02 teracow

Ni I didn't mean the time parameter, you allready offer this I guess. I was hoping there is a similar custom-period functionaly as in the text search, but it doesn't unfortunately.

However the workaround can be quite simple. Just add a year (2011) to the search phrase and with a bit of luck the page only returns images regarding your search phrase of that year. This needs a bit more testing, but first checks seem promising.

Another Google setting that is interesting is the Searchsettings option under Settings. There you can specify the quantity of search results per page. Maybe this setting can help to get more than 40 img per run.

Cheers!

LeaTaka avatar Feb 20 '20 23:02 LeaTaka

Okiedoke, some good thoughts there.

I'll see if I can spend some time on it this weekend.

teracow avatar Feb 20 '20 23:02 teracow