undercrawler icon indicating copy to clipboard operation
undercrawler copied to clipboard

Lua page script timeouts when trying to render binary pages

Open lopuhin opened this issue 9 years ago • 5 comments
trafficstars

When the page is not an html page but binary content (we can not know for sure when extracting links), the Lua script timeouts (even without HH enabled). No only we do not download such pages, but this also slows down the whole crawl a lot.

lopuhin avatar Apr 29 '16 14:04 lopuhin

what happens? is it because timeout is not large enough to download a file, or is it a problem because Splash doesn't handle non-html splash:go?

kmike avatar Apr 29 '16 14:04 kmike

It's the latter - just plain

function main(splash)
  local url = splash.args.url
  assert(splash:go{url})
end

~~fails~~ timeouts

lopuhin avatar Apr 29 '16 14:04 lopuhin

But it looks like it's not ANY binary content causes splash:go to fail, will try to narrow it down.

lopuhin avatar Apr 29 '16 15:04 lopuhin

Splash doesn't handle unsupported content now (http://doc.qt.io/archives/qt-5.5/qwebpage.html#forwardUnsupportedContent-prop), to fix it we need to add an API for that to Splash

kmike avatar Apr 29 '16 15:04 kmike

The link was extracted from <a href="URL" id="ctl00_MasterMain_Hot_rpHot_ctl10_navImg" onclick="return hs.expand(this, {captionId: \'caption1\'})"> element, so all is correct (I don't think we should drop links with onclick if we don't click on them).

lopuhin avatar Apr 29 '16 15:04 lopuhin