SnapchatBot icon indicating copy to clipboard operation
SnapchatBot copied to clipboard

cssselect fails on Googlerbot

Open ahoskins opened this issue 10 years ago • 6 comments

Hi, cool library - really fun stuff.

I'm trying to get the Googlerbot to work, but having trouble with:

href = root.cssselect(".bia")[0].attrib["href"]

The root object does not have any DOM nodes with class = bia. Does this have to do with google's HTML being asynchronous? When I write the root object to file there is some HTML from the response, but not all of it.

Any idea how I can fix this problem?

ahoskins avatar Feb 22 '15 04:02 ahoskins

I've also tried looking into this, I've tried fetching the page with urlopen and with requests with and without redirects, and can't seem to get html that looks like what you would get from a browser. Perhaps fiddling with the User-Agent may fix the problem?

mossbanay avatar Feb 23 '15 08:02 mossbanay

User-agent doesn't make a difference for me. I'm fairly sure google results are lazily loaded so parsing the HTML synchronously won't work. I ended up using the (deprecated) google image search API - it just returns JSON so it's much easier to work with.

ahoskins avatar Feb 25 '15 05:02 ahoskins

Could you close the issue and/or make a commit with your solution ? Thanks :smile:

N07070 avatar Mar 04 '15 23:03 N07070

Sorry, I forgot about this. I do have a solution that works and will pull request as soon as I can get around to it.

ahoskins avatar Apr 07 '15 18:04 ahoskins

Awesome! Will this be invite only or open to public github? Any idea on a time frame?

mavieth avatar Apr 07 '15 18:04 mavieth

It will be by the end of the week! I have my own public project using this library. It is a working extension of the google-bot - feel free to check it out. I'll clean it up, get rid of some of the extra stuff, and send a pull request to this project later this week.

ahoskins avatar Apr 08 '15 00:04 ahoskins