GoogleScraper icon indicating copy to clipboard operation
GoogleScraper copied to clipboard

Support for TOR browser

Open frandres opened this issue 10 years ago • 5 comments

Using Tor Browser (https://www.torproject.org/projects/torbrowser.html.en) is a nice way to run queries with different ips without having to set proxies. Since it is based on Firefox, it's relatively easy to set up with GoogleScrapper. You just have load firefox using a profile with Tor's configuration, which means changing this line: self.webdriver = webdriver.Firefox()

in the _get_Firefox method in the selenium_mode.py file to

            profile.set_preference('network.proxy.socks_port', 9150)
            profile.set_preference('network.proxy.type', 1)
            profile.set_preference('network.proxy.socks', '127.0.0.1')
            profile.set_preference('network.proxy.socks_port', 9150)
            self.webdriver = webdriver.Firefox(profile)

The nice thing is that you can request a new IP whenever you want by running:

from stem import Signal from stem.control import Controller with Controller.from_port(port = 9151) as controller: controller.authenticate() controller.signal(Signal.NEWNYM) controller.signal(Signal.HUP)

Meaning that whenever your script is caught as a robot you can request a new IP, load a new instance of Firefox and resume your scrapping. Empirically this seems to work; the search engine catches you sometimes but if you keep trying it seems to eventually get an IP that is not detected.

This might be a nice feature for future development. I can send my selenium_mode.py version to whoever wants to try this.

frandres avatar May 18 '15 12:05 frandres

Will look into this! Huge tanks :)

NikolaiT avatar May 20 '15 11:05 NikolaiT

I'm working on a solution for this. Chances are stem use will be limited. It's much easier to spawn a ton of Tor instances with different ports and connect through those. Stem can then be used to get a new IP if the number of instances is not great enough, as well as to sort relays by connection speed so that we may prefer the fastest ones.

neuegram avatar Jul 14 '15 19:07 neuegram

Both methods work pretty well, bunch or tor instances and then rotate ips periodically if google does ban them temporarily, front the whole bunch with haproxy -> privoxy

matthewford avatar Jul 23 '15 01:07 matthewford

Frandres, It would be gr8 if you can send selenium_mode.py version of Tor settings and other files which I would like to try.Thanks in advance.

yh18190 avatar Nov 16 '15 16:11 yh18190

Could you send me the new selenium_mode.py for TOR?

alon001 avatar May 29 '17 06:05 alon001