undercrawler
undercrawler copied to clipboard
Lua page script timeouts when trying to render binary pages
When the page is not an html page but binary content (we can not know for sure when extracting links), the Lua script timeouts (even without HH enabled). No only we do not download such pages, but this also slows down the whole crawl a lot.
what happens? is it because timeout is not large enough to download a file, or is it a problem because Splash doesn't handle non-html splash:go?
It's the latter - just plain
function main(splash)
local url = splash.args.url
assert(splash:go{url})
end
~~fails~~ timeouts
But it looks like it's not ANY binary content causes splash:go to fail, will try to narrow it down.
Splash doesn't handle unsupported content now (http://doc.qt.io/archives/qt-5.5/qwebpage.html#forwardUnsupportedContent-prop), to fix it we need to add an API for that to Splash
The link was extracted from <a href="URL" id="ctl00_MasterMain_Hot_rpHot_ctl10_navImg" onclick="return hs.expand(this, {captionId: \'caption1\'})"> element, so all is correct (I don't think we should drop links with onclick if we don't click on them).