node-horseman icon indicating copy to clipboard operation
node-horseman copied to clipboard

So many 'failed to GET url'

Open minotaurrr opened this issue 7 years ago • 6 comments

I'm just doing horseman.open('https://www.google.com') for testing but getting sooo many failed to get URL just at random times - maybe about 7 out of 10 times it'll fail.

any idea why?

minotaurrr avatar Nov 15 '17 15:11 minotaurrr

Kicked the tires for this library following the docs for the project and saw a similar thing. Both Twitter and Google examples failed to run.

horseman v3.3.0 node v 8.9.1

nelsonwittwer avatar Nov 17 '17 22:11 nelsonwittwer

Tried on multiple hosts, and did notice that frequencies vary. But still getting the same error at some point evenutially

minotaurrr avatar Nov 17 '17 22:11 minotaurrr

Up to this topic, same happening to me

grohsfabian avatar Nov 20 '17 08:11 grohsfabian

Up to this, I'm getting it repeatedly, not can I catch them

NoelDavies avatar Nov 24 '17 15:11 NoelDavies

minotaurrr, Google detects scrapper and banned your IP address very quickly. That's mean you can only "horseman.open('http://google.com') " ONCE every 5 minutes. If you want to scrap it more than 1 time per 5 minutes, you need to :

  • set up proxy in horseman options
  • clean cookies with horseman.cookies()
  • changing User-Agent in horseman -also modify your value in horseman.wait(value). If you always have same timing interval between your request, google will flagged it.

t0ursene avatar Jan 04 '18 15:01 t0ursene

Google must have banned your IP. Set the time interval between GET request OR set a list of proxy and cycle through randomly.

jorgerosal avatar Jan 24 '18 18:01 jorgerosal