scrapy-splash Bad request to Splash & HTTP status code is not handled or not allowed

Bad request to Splash & HTTP status code is not handled or not allowed

Open linukey opened this issue 6 years ago • 4 comments

hi kmike, i use scrapy-splash and meet a issue, when i first run 'scrapy crawl toutiao', it's run right, bug when i run it's second, it occur a issue.

i find the issue because headers i add, when i not use headers, it's run right, but it's errors when i use headers and run the second.

the lua script and project follows, i need your help, thanks.

code:

import scrapy
import json
from scrapy_splash import SplashRequest
from scrapy.http.headers import Headers

script = """ 
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{
                    splash.args.url,
                    headers=splash.args.headers,
                    http_method=splash.args.http_method,
                    body=splash.args.body,
                  })

  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response

  return {
    headers = last_response.headers,
    cookies = splash:get_cookies(),
    html = splash:html(),
    url = splash:url(),
    http_status = last_response.status,
  }
end
"""

HEADERS = Headers({
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'compress',
    'Accept-Language': 'en-US',
    'Connection': 'keep-alive',
    'Cache-Control': 'no-cache',
    'Pragma': 'no-cache',
    'Host':'m.toutiao.com',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36'
})

class MySpider(scrapy.Spider):
    name = "toutiao"

    def __init__(self):
        self.start_url = "https://m.toutiao.com"

    def start_requests(self):
            yield SplashRequest(url=self.start_url,
                                callback=self.parse_result,
                                endpoint='execute',
                                cache_args=['lua_source'],
                                args={'lua_source': script, 'http_method': 'GET'},
                                headers=HEADERS)

    def parse_result(self, response):
        print("ok")
        print(response.headers)

the first run correct:

ok
{b'Vary': [b'Accept-Encoding, Accept-Encoding, Accept-Encoding'], b'Timing-Allow-Origin': [b'*'], b'Set-Cookie': [b'tt_webid=653006869922952004; Max-Age=7776000'], b'Transfer-Encoding': [b
'chunked'], b'Content-Type': [b'text/html; charset=utf-8'], b'Connection': [b'keep-alive'], b'X-Tt-Timestamp': [b'152040098.652'], b'X-Ss-Set-Cookie': [b'tt_webid=653006899221952004; Max-
Age=7776000'], b'Server': [b'Tengine'], b'Via': [b'cache1.cn406[13,0]'], b'Content-Encoding': [b'gzip'], b'Eagleid': [b'dcb54e411524000986256455e'], b'Date': [b'Wed, 07 Mar 2018 05:21:38 G
MT']}

the second run error:

2018-03-07 13:18:54 [scrapy.core.engine] INFO: Spider opened
2018-03-07 13:18:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-07 13:18:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-07 13:18:55 [scrapy_splash.middleware] WARNING: Bad request to Splash: {'info': {'message': 'Lua error: [string "..."]:14: attempt to index field \'?\' (a nil value)', 'type': 'LUA_
ERROR', 'source': '[string "..."]', 'error': "attempt to index field '?' (a nil value)", 'line_number': 14}, 'description': 'Error happened while executing Lua script', 'error': 400, 'type'
: 'ScriptError'}
2018-03-07 13:18:55 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://m.toutiao.com via http://172.17.0.2:8050/execute> (referer: None)
2018-03-07 13:18:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://m.toutiao.com>: HTTP status code is not handled or not allowed
2018-03-07 13:18:55 [scrapy.core.engine] INFO: Closing spider (finished)

Mar 07 '18 05:03 linukey

the 'Bad request to splash' error maybe caused by 'local last_response = entries[#entries].response', but i don't konw how to fix it.

Mar 07 '18 05:03 linukey

I have a similar issue. For some requests which I make, splash:history() returns an empty array, which makes subsequent indexing into entries[#entries] throw an error. What could cause Splash to not populate the history? And how to get resulting headers and http status in this case?

Mar 15 '18 01:03 nirvana-msu

Yeah, it can be the problem. It is caused by cache: when response is fetched from an in-memory cache, it doesn't get a record in splash:history. I don't have a good workaround now; it makes sense to check if history is not empty before taking last entry.

Mar 15 '18 14:03 kmike

@kmike I am fine to disable cache (in fact, I would prefer to do that). It seems like it's not possible until https://github.com/scrapinghub/splash/pull/339 is merged? Related issues: https://github.com/scrapinghub/splash/issues/203, https://github.com/scrapinghub/splash/issues/519.

Mar 17 '18 14:03 nirvana-msu

scrapy-splash scrapy-splash copied to clipboard

Bad request to Splash & HTTP status code is not handled or not allowed

scrapy-splash
scrapy-splash copied to clipboard