scrapy-splash
scrapy-splash copied to clipboard
how to get redirect urls with scrapy-splash
I don't know how to get the redirect urls with scrapy-splash,can you help me? eg. http://xxx.xxx.xxx/1.php will redirect to http://xxx.xxx.xxx/index.php,how can I get http://xxx.xxx.xxx/index.php with scrapy-splash? Below is my code which can not get http://xxx.xxx.xxx/index.php but get http://xxx.xxx.xxx/1.php
def parse_get(self, response):
item = CrawlerItem()
item['code'] = response.status
item['current_url'] = response.url
############################# below print http://xxx.xxx.xxx/1.php
print(response.url)
self.lua_script = """
function main(splash, args)
assert(splash:go{splash.args.url,http_method=splash.args.http_method,body=splash.args.body,headers={
['Cookie']='%s',
}
}
)
assert(splash:wait(0.5))
splash:on_request(function(request)
request:set_proxy{
host = "%s",
port = %d
}
end)
return {cookies = splash:get_cookies(),html=splash:html()}
end
""" % (self.cookie,a[0],a[1])
url='http://xxx.xxx.xxx/1.php'
SplashRequest(url, self.parse_get, endpoint='execute', magic_response=True, meta={'handle_httpstatus_all': True}, args={'lua_source': self.lua_script})
@3xp10it splash handles redirects by itself, so the result you are getting is from a page where it was redirected. To get it's URL, you can add url = splash:url()
to return values (see example in README below "Use a Lua script to get an HTML response with cookies, headers, body and method set to correct values") - after that response.url should be from the redirected page.
@lopuhin
In my code,http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name=
will redirect to http://192.168.93.139/dvwa/vulnerabilities/xss_r/index.php
,
I try to add url = splash:url()
,but still fail:
self.lua_script = """
function main(splash, args)
assert(splash:go{splash.args.url,http_method=splash.args.http_method,body=splash.args.body,headers={
['Cookie']='%s',
}
}
)
assert(splash:wait(0.5))
splash:on_request(function(request)
request:set_proxy{
host = "%s",
port = %d
}
end)
return { url = splash:url(), cookies = splash:get_cookies(), html = splash:html(), }
end
""" % (self.cookie,a[0],a[1])
def parse_get(self, response):
input(44444444444444)
item = CrawlerItem()
item['code'] = response.status
item['current_url'] = response.url
print(response.url)
input(3333333333)
if response.url=="http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name=":
print('fail ....................')
if response.url=="http://192.168.93.139/dvwa/vulnerabilities/xss_r/index.php":
print('succeed .................')
Below is the result:
2222222222222
2017-11-29 17:26:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.93.139/robots.txt> (referer: None)
2017-11-29 17:26:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.89.190:8050/robots.txt> (referer: None)
2017-11-29 17:26:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name= via http://192.168.89.190:8050/execute> (referer: None)
44444444444444
http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name=
3333333333
fail ....................
@3xp10it I see, that's not what I expected... Just to be sure, you are not turning off magic_response
anywhere, and scrapy_splash.SplashMiddleware
is used, right? Also, maybe you could try crawling https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F to check if it works for another domain?
@lopuhin The url you give me works well,below is the result:
http://httpbin.org/redirect-to?url=http://example.com/
2222222222222
2017-11-29 17:38:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/robots.txt> (referer: None)
2017-11-29 17:38:42 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.89.190:8050/robots.txt> (referer: None)
2017-11-29 17:38:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/redirect-to?url=http://example.com/ via http://192.168.89.190:8050/execute> (referer: None)
44444444444444
http://example.com/
3333333333
It's a strange result,can you help me explain it?
@3xp10it in this case I would first check that the redirect is handled correctly by splash using the splash UI (visit splash URL with a browser and try loading the page you want to be crawled). If the redirect is handled differently by a browser and splash, this is a splash problem. Do you know how the redirect is implemented - if it's done in javascript, maybe more wait time will help?
@lopuhin
splash UI works well and return the right url:
url: "http://192.168.93.139/dvwa/vulnerabilities/xss_r/index.php"
my script used in splash UI is:
function main(splash, args)
assert(splash:go{args.url,headers={
['Cookie']='security=impossible;PHPSESSID=q6ms9hf7sf8kingjhtespmpfu3;security=low',
}
}
)
assert(splash:wait(0.5))
return {
url = splash:url(),
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
That's to say,redirect is handled differently by a browser and splash
?
@3xp10it this is great that this works in splash UI - this meant it's not a splash problem. But to be honest, now I'm not even sure where the problem can be. One more check that might help to debug this would be to print response.data
- this should be a dict returned by splash script. If the url is redirected there, then the problem is in scrapy-splash middleware or in how it is used. If the url there is not what you want, then there could be some difference in the way splash is called between splash UI and the spider.
@lopuhin
the url in response.data
is not the redirected url I want,below is the result:
2222222222222
2017-11-29 18:08:22 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.93.139/robots.txt> (referer: None)
2017-11-29 18:08:22 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.89.190:8050/robots.txt> (referer: None)
2017-11-29 18:08:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name= via http://192.168.89.190:8050/execute> (referer: None)
44444444444444
http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name=
5555555555555
{'url': 'http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name=', 'html': '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><h
tml xmlns="http://www.w3.org/1999/xhtml"><head>\n\t\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n\n\t\t<title>Vulnerability: Reflected Cross Site Scripting (XSS) :: Damn Vulnerable Web A
Should I change my middleware setting? Below is my settings.py:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
#'crawler.middlewares.ProxyMiddleware': 843,
}
@3xp10it The middlware settings you provided look good.. Since the url in response.data is not what you want, the problem must be not in how response is processed in scrapy-splash, but in how splash is called. Maybe you can try to use exactly the same script that works in splash UI for your spider?
@lopuhin I use exactly the same script that works in splash UI for my spider,it doesn't work:(,below is the script:
function main(splash, args)
assert(splash:go{splash.args.url,http_method=splash.args.http_method,body=splash.args.body,headers={
['Cookie']='security=impossible;PHPSESSID=q6ms9hf7sf8kingjhtespmpfu3;security=low',
}
})
assert(splash:wait(2))
return { url = splash:url(), cookies = splash:get_cookies(), html = splash:html(), }
end
So, is there any solution to see redirected url (the new one) inside scrapy-splash?
I have the same problem and would be interested in a solution.
Im not looking for this solution myself, but just an idea: if its possible to fetch HAR data using scrapy-splash, it can be used to figure out all redirects. https://splash.readthedocs.io/en/stable/api.html#render-har
I have the same problem and would be interested in a solution.
I'm sorry, I lost too much time trying to resolve this and switch to a better solution - pypeteer. It doesn't generate a full HAR, but for my needs is enough.
I recommend to look to it also...
This does not seem specific to scrapy-splash, shall we move this to https://github.com/scrapinghub/splash?