AIL-framework icon indicating copy to clipboard operation
AIL-framework copied to clipboard

When crawling, all domains appear to be DOWN

Open sunil3590 opened this issue 5 years ago • 5 comments

ISSUE I tried to crawl a regular domain (not .onion) and the status fo the domain comes up as DOWN. I've tried this will multiple domains and even .onion domains but the result is the same, all domains are DOWN.

SETUP I have AIL, TOR, and Splash all installed and running on a single machine with one docker instance of Splash running on 8050 and Tor running on 9050

tcp        0      0 127.0.0.1:9050          0.0.0.0:*               LISTEN      18298/tor           
tcp6       0      0 :::8050                 :::*                    LISTEN      22611/docker-proxy 

Logs from Splash Docker

2020-04-10 08:56:20.300419 [-] "X.X.X.X" - - [10/Apr/2020:08:56:19 +0000] "GET / HTTP/1.1" 200 7679 "-" "python-requests/2.22.0"
2020-04-10 08:56:20.859058 [render] [140342956635136] loadFinished: unknown error
2020-04-10 08:56:20.860248 [events] {"path": "/execute", "rendertime": 0.007615327835083008, "maxrss": 176844, "load": [0.05, 0.19, 0.18], "fds": 60, "active": 0, "qsize": 0, "_id": 140342956635136, "method": "POST", "timestamp": 1586508980, "user-agent": "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0", "args": {"cookies": [], "headers": {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0"}, "lua_source": "\nfunction main(splash, args)\n    -- Default values\n    splash.js_enabled = true\n    splash.private_mode_enabled = true\n    splash.images_enabled = true\n    splash.webgl_enabled = true\n    splash.media_source_enabled = true\n\n    -- Force enable things\n    splash.plugins_enabled = true\n    splash.request_body_enabled = true\n    splash.response_body_enabled = true\n\n    splash.indexeddb_enabled = true\n    splash.html5_media_enabled = true\n    splash.http2_enabled = true\n\n    -- User defined\n    splash.resource_timeout = args.resource_timeout\n    splash.timeout = args.timeout\n\n    -- Allow to pass cookies\n    splash:init_cookies(args.cookies)\n\n    -- Run\n    ok, reason = splash:go{args.url}\n    if not ok and not reason:find(\"http\") then\n        return {\n            error = reason,\n            last_url = splash:url()\n        }\n    end\n    if reason == \"http504\" then\n        splash:set_result_status_code(504)\n        return ''\n    end\n\n    splash:wait{args.wait}\n    -- Page instrumentation\n    -- splash.scroll_position = {y=1000}\n    splash:wait{args.wait}\n    -- Response\n    return {\n        har = splash:har(),\n        html = splash:html(),\n        png = splash:png{render_all=true},\n        cookies = splash:get_cookies(),\n        last_url = splash:url()\n    }\nend\n", "resource_timeout": 30, "timeout": 30, "url": "http://somedomain.onion", "wait": 10, "uid": 140342956635136}, "status_code": 200, "client_ip": "172.17.0.1"}
2020-04-10 08:56:20.860431 [-] "172.17.0.1" - - [10/Apr/2020:08:56:19 +0000] "POST /execute HTTP/1.1" 200 68 "-" "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0"

The line of code in Splash generating the error message above https://github.com/scrapinghub/splash/blob/9fda128b8485dd5f67eb103cd30df8f325a90bb0/splash/engines/webkit/browser_tab.py#L446

sunil3590 avatar Apr 10 '20 09:04 sunil3590

Were you able to fix this? @sunil3590 Experiencing the same issue, Splash Down and all domains are down.

GaganBhat avatar Sep 28 '20 02:09 GaganBhat

@Terrtia I'm having a similar issue with Tor links where I get a "SPLASH DOWN" error but only with onion links. image

Regular crawler however works. image

GaganBhat avatar Oct 02 '20 02:10 GaganBhat

Hello I have the same issue. Is there any update? thanks

TheFausap avatar Feb 08 '21 14:02 TheFausap

Maybe I found the error in the screen logs (screen -r Crawlers_AIL):

 File "/opt/AIL/bin/torcrawler/TorSplashCrawler.py", line 181, in parse
    error_retry = request.meta.get('error_retry', 0)
NameError: name 'request' is not defined

TheFausap avatar Feb 08 '21 14:02 TheFausap

@TheFausap @Terrtia did you find the fix for this? i also cant crawl any onion domain since they appear to be down

matriceria avatar Feb 12 '22 21:02 matriceria