scrapy-splash icon indicating copy to clipboard operation
scrapy-splash copied to clipboard

How to prevent Splash sending its default headers i.e. 'Host'?

Open iamumairayub opened this issue 4 years ago • 0 comments

I had just deployed Splash (in Docker) like a month ago on my dedicated server.

I am trying to scrape a website with Scrapy Splash, but I get following error no matter how many time I try that url

([scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.website.com via http://localhost:8050/render.html> (failed 1 times): User timeout caused connection failure: Getting http://localhost:8050/render.html took longer than 80.0 seconds..)

Meanwhile, same Splash server successfully scrapes every site I try.

If I try to cURL or scrapy.Request the above url from my server, it works, the site does not block no matter how many times I scrape via cURL or scrapy.Request

Then I had idea to see if there are some headers Splash is sending, I debugged Splash request headers via http://httpbin.org/get and found out that it automatically adds few headers

So now I know that Splash is sending "Host": "website.com" to the target site, which makes that website not scrape.

Question is, how do I make Splash not send any headers automatically? Or at least stop Splash from sending Host header?

iamumairayub avatar Feb 03 '20 14:02 iamumairayub