scrapy-splash icon indicating copy to clipboard operation
scrapy-splash copied to clipboard

how to get redirect urls with scrapy-splash

Open 3xp10it opened this issue 7 years ago • 15 comments

I don't know how to get the redirect urls with scrapy-splash,can you help me? eg. http://xxx.xxx.xxx/1.php will redirect to http://xxx.xxx.xxx/index.php,how can I get http://xxx.xxx.xxx/index.php with scrapy-splash? Below is my code which can not get http://xxx.xxx.xxx/index.php but get http://xxx.xxx.xxx/1.php

    def parse_get(self, response):
        item = CrawlerItem()
        item['code'] = response.status
        item['current_url'] = response.url
        ############################# below print http://xxx.xxx.xxx/1.php
        print(response.url)


self.lua_script = """
        function main(splash, args)
          assert(splash:go{splash.args.url,http_method=splash.args.http_method,body=splash.args.body,headers={
              ['Cookie']='%s',
              }
              }
              )
          assert(splash:wait(0.5))

          splash:on_request(function(request)
              request:set_proxy{
                  host = "%s",
                  port = %d
              }
          end)

          return {cookies = splash:get_cookies(),html=splash:html()}
        end
        """ % (self.cookie,a[0],a[1])

url='http://xxx.xxx.xxx/1.php'
SplashRequest(url, self.parse_get, endpoint='execute', magic_response=True, meta={'handle_httpstatus_all': True}, args={'lua_source': self.lua_script})


3xp10it avatar Nov 29 '17 08:11 3xp10it

@3xp10it splash handles redirects by itself, so the result you are getting is from a page where it was redirected. To get it's URL, you can add url = splash:url() to return values (see example in README below "Use a Lua script to get an HTML response with cookies, headers, body and method set to correct values") - after that response.url should be from the redirected page.

lopuhin avatar Nov 29 '17 09:11 lopuhin

@lopuhin

In my code,http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name= will redirect to http://192.168.93.139/dvwa/vulnerabilities/xss_r/index.php, I try to add url = splash:url() ,but still fail:

self.lua_script = """
        function main(splash, args)
          assert(splash:go{splash.args.url,http_method=splash.args.http_method,body=splash.args.body,headers={
              ['Cookie']='%s',
              }
              }
              )
          assert(splash:wait(0.5))

          splash:on_request(function(request)
              request:set_proxy{
                  host = "%s",
                  port = %d
              }
          end)

          return { url = splash:url(),  cookies = splash:get_cookies(), html = splash:html(), }
        end
        """ % (self.cookie,a[0],a[1])


    def parse_get(self, response):
        input(44444444444444)
        item = CrawlerItem()
        item['code'] = response.status
        item['current_url'] = response.url
        print(response.url)
        input(3333333333)
        if response.url=="http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name=":
            print('fail ....................')
        if response.url=="http://192.168.93.139/dvwa/vulnerabilities/xss_r/index.php":
            print('succeed .................')

Below is the result:

2222222222222
2017-11-29 17:26:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.93.139/robots.txt> (referer: None)
2017-11-29 17:26:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.89.190:8050/robots.txt> (referer: None)
2017-11-29 17:26:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name= via http://192.168.89.190:8050/execute> (referer: None)
44444444444444
http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name=
3333333333
fail ....................

3xp10it avatar Nov 29 '17 09:11 3xp10it

@3xp10it I see, that's not what I expected... Just to be sure, you are not turning off magic_response anywhere, and scrapy_splash.SplashMiddleware is used, right? Also, maybe you could try crawling https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F to check if it works for another domain?

lopuhin avatar Nov 29 '17 09:11 lopuhin

@lopuhin The url you give me works well,below is the result:

http://httpbin.org/redirect-to?url=http://example.com/                                                                                             
2222222222222                                                                                                                                                         
2017-11-29 17:38:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/robots.txt> (referer: None)                                     
2017-11-29 17:38:42 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.89.190:8050/robots.txt> (referer: None)                                                                                       
2017-11-29 17:38:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/redirect-to?url=http://example.com/ via http://192.168.89.190:8050/execute> (referer: None)
44444444444444                                                                      
http://example.com/                                                                                                                                                         
3333333333

It's a strange result,can you help me explain it?

3xp10it avatar Nov 29 '17 09:11 3xp10it

@3xp10it in this case I would first check that the redirect is handled correctly by splash using the splash UI (visit splash URL with a browser and try loading the page you want to be crawled). If the redirect is handled differently by a browser and splash, this is a splash problem. Do you know how the redirect is implemented - if it's done in javascript, maybe more wait time will help?

lopuhin avatar Nov 29 '17 09:11 lopuhin

@lopuhin

splash UI works well and return the right url:

url: "http://192.168.93.139/dvwa/vulnerabilities/xss_r/index.php"

my script used in splash UI is:

function main(splash, args)
  assert(splash:go{args.url,headers={
              ['Cookie']='security=impossible;PHPSESSID=q6ms9hf7sf8kingjhtespmpfu3;security=low',
              }
              }
              )
  assert(splash:wait(0.5))
  return {
    url = splash:url(),
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

That's to say,redirect is handled differently by a browser and splash?

3xp10it avatar Nov 29 '17 09:11 3xp10it

@3xp10it this is great that this works in splash UI - this meant it's not a splash problem. But to be honest, now I'm not even sure where the problem can be. One more check that might help to debug this would be to print response.data - this should be a dict returned by splash script. If the url is redirected there, then the problem is in scrapy-splash middleware or in how it is used. If the url there is not what you want, then there could be some difference in the way splash is called between splash UI and the spider.

lopuhin avatar Nov 29 '17 10:11 lopuhin

@lopuhin the url in response.data is not the redirected url I want,below is the result:

2222222222222                                               
2017-11-29 18:08:22 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.93.139/robots.txt> (referer: None)
2017-11-29 18:08:22 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.89.190:8050/robots.txt> (referer: None)
2017-11-29 18:08:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name= via http://192.168.89.190:8050/execute> (referer: None)
44444444444444                                        
http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name=
5555555555555                                               
{'url': 'http://192.168.93.139/dvwa/vulnerabilities/xss_r/?name=?name=?name=?name=?name=', 'html': '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><h
tml xmlns="http://www.w3.org/1999/xhtml"><head>\n\t\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n\n\t\t<title>Vulnerability: Reflected Cross Site Scripting (XSS) :: Damn Vulnerable Web A

Should I change my middleware setting? Below is my settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    #'crawler.middlewares.ProxyMiddleware': 843,
}

3xp10it avatar Nov 29 '17 10:11 3xp10it

@3xp10it The middlware settings you provided look good.. Since the url in response.data is not what you want, the problem must be not in how response is processed in scrapy-splash, but in how splash is called. Maybe you can try to use exactly the same script that works in splash UI for your spider?

lopuhin avatar Nov 29 '17 10:11 lopuhin

@lopuhin I use exactly the same script that works in splash UI for my spider,it doesn't work:(,below is the script:

        function main(splash, args)
          assert(splash:go{splash.args.url,http_method=splash.args.http_method,body=splash.args.body,headers={
         ['Cookie']='security=impossible;PHPSESSID=q6ms9hf7sf8kingjhtespmpfu3;security=low',
      }
              })
          assert(splash:wait(2))

          return { url = splash:url(),  cookies = splash:get_cookies(), html = splash:html(), }
        end

3xp10it avatar Nov 30 '17 01:11 3xp10it

So, is there any solution to see redirected url (the new one) inside scrapy-splash?

civanescu avatar Jan 22 '19 14:01 civanescu

I have the same problem and would be interested in a solution.

mutterkorn avatar Feb 05 '19 10:02 mutterkorn

Im not looking for this solution myself, but just an idea: if its possible to fetch HAR data using scrapy-splash, it can be used to figure out all redirects. https://splash.readthedocs.io/en/stable/api.html#render-har

kai11 avatar Feb 08 '19 08:02 kai11

I have the same problem and would be interested in a solution.

I'm sorry, I lost too much time trying to resolve this and switch to a better solution - pypeteer. It doesn't generate a full HAR, but for my needs is enough.

I recommend to look to it also...

civanescu avatar Feb 14 '19 09:02 civanescu

This does not seem specific to scrapy-splash, shall we move this to https://github.com/scrapinghub/splash?

Gallaecio avatar Nov 26 '19 11:11 Gallaecio