scrapy-splash
scrapy-splash copied to clipboard
How to prevent Splash sending its default headers i.e. 'Host'?
I had just deployed Splash (in Docker) like a month ago on my dedicated server.
I am trying to scrape a website with Scrapy Splash, but I get following error no matter how many time I try that url
([scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.website.com via http://localhost:8050/render.html> (failed 1 times): User timeout caused connection failure: Getting http://localhost:8050/render.html took longer than 80.0 seconds..)
Meanwhile, same Splash server successfully scrapes every site I try.
If I try to cURL or scrapy.Request
the above url from my server, it works, the site does not block no matter how many times I scrape via cURL or scrapy.Request
Then I had idea to see if there are some headers Splash is sending, I debugged Splash request headers via http://httpbin.org/get and found out that it automatically adds few headers
So now I know that Splash is sending "Host": "website.com"
to the target site, which makes that website not scrape.
Question is, how do I make Splash not send any headers automatically? Or at least stop Splash from sending Host
header?