scrapy-splash
scrapy-splash copied to clipboard
Bad request to Splash & HTTP status code is not handled or not allowed
hi kmike, i use scrapy-splash and meet a issue, when i first run 'scrapy crawl toutiao', it's run right, bug when i run it's second, it occur a issue.
i find the issue because headers i add, when i not use headers, it's run right, but it's errors when i use headers and run the second.
the lua script and project follows, i need your help, thanks.
code:
import scrapy
import json
from scrapy_splash import SplashRequest
from scrapy.http.headers import Headers
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
headers = last_response.headers,
cookies = splash:get_cookies(),
html = splash:html(),
url = splash:url(),
http_status = last_response.status,
}
end
"""
HEADERS = Headers({
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'compress',
'Accept-Language': 'en-US',
'Connection': 'keep-alive',
'Cache-Control': 'no-cache',
'Pragma': 'no-cache',
'Host':'m.toutiao.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36'
})
class MySpider(scrapy.Spider):
name = "toutiao"
def __init__(self):
self.start_url = "https://m.toutiao.com"
def start_requests(self):
yield SplashRequest(url=self.start_url,
callback=self.parse_result,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script, 'http_method': 'GET'},
headers=HEADERS)
def parse_result(self, response):
print("ok")
print(response.headers)
the first run correct:
ok
{b'Vary': [b'Accept-Encoding, Accept-Encoding, Accept-Encoding'], b'Timing-Allow-Origin': [b'*'], b'Set-Cookie': [b'tt_webid=653006869922952004; Max-Age=7776000'], b'Transfer-Encoding': [b
'chunked'], b'Content-Type': [b'text/html; charset=utf-8'], b'Connection': [b'keep-alive'], b'X-Tt-Timestamp': [b'152040098.652'], b'X-Ss-Set-Cookie': [b'tt_webid=653006899221952004; Max-
Age=7776000'], b'Server': [b'Tengine'], b'Via': [b'cache1.cn406[13,0]'], b'Content-Encoding': [b'gzip'], b'Eagleid': [b'dcb54e411524000986256455e'], b'Date': [b'Wed, 07 Mar 2018 05:21:38 G
MT']}
the second run error:
2018-03-07 13:18:54 [scrapy.core.engine] INFO: Spider opened
2018-03-07 13:18:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-07 13:18:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-07 13:18:55 [scrapy_splash.middleware] WARNING: Bad request to Splash: {'info': {'message': 'Lua error: [string "..."]:14: attempt to index field \'?\' (a nil value)', 'type': 'LUA_
ERROR', 'source': '[string "..."]', 'error': "attempt to index field '?' (a nil value)", 'line_number': 14}, 'description': 'Error happened while executing Lua script', 'error': 400, 'type'
: 'ScriptError'}
2018-03-07 13:18:55 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://m.toutiao.com via http://172.17.0.2:8050/execute> (referer: None)
2018-03-07 13:18:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://m.toutiao.com>: HTTP status code is not handled or not allowed
2018-03-07 13:18:55 [scrapy.core.engine] INFO: Closing spider (finished)
the 'Bad request to splash' error maybe caused by 'local last_response = entries[#entries].response', but i don't konw how to fix it.
I have a similar issue. For some requests which I make, splash:history()
returns an empty array, which makes subsequent indexing into entries[#entries]
throw an error. What could cause Splash to not populate the history? And how to get resulting headers and http status in this case?
Yeah, it can be the problem. It is caused by cache: when response is fetched from an in-memory cache, it doesn't get a record in splash:history. I don't have a good workaround now; it makes sense to check if history is not empty before taking last entry.
@kmike I am fine to disable cache (in fact, I would prefer to do that). It seems like it's not possible until https://github.com/scrapinghub/splash/pull/339 is merged? Related issues: https://github.com/scrapinghub/splash/issues/203, https://github.com/scrapinghub/splash/issues/519.