scrapy-splash icon indicating copy to clipboard operation
scrapy-splash copied to clipboard

QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once.

Open JWBWork opened this issue 1 week ago • 0 comments

this specific website is throwing an exception I can't understand.

QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once.

it results in the splash docker container hanging. It becomes unresponsive to all future requests. More verbose logs didn't reveal any more info

The logs

(.venv) C:\Users\me\path\to\project>docker run -p 8050:8050 scrapinghub/splash:latest                                                                          
2024-06-25 20:39:41+0000 [-] Log opened.
2024-06-25 20:39:41.947216 [-] Xvfb is started: ['Xvfb', ':769163157', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash'
2024-06-25 20:39:42.012362 [-] Splash version: 3.5
2024-06-25 20:39:42.045852 [-] Qt 5.14.1, PyQt 5.14.2, WebKit 602.1, Chromium 77.0.3865.129, sip 4.19.22, Twisted 19.7.0, Lua 5.2
2024-06-25 20:39:42.046036 [-] Python 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0]
2024-06-25 20:39:42.046099 [-] Open files limit: 1048576
2024-06-25 20:39:42.046140 [-] Can't bump open files limit
2024-06-25 20:39:42.061355 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2024-06-25 20:39:42.061513 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2024-06-25 20:39:42.170427 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2024-06-25 20:39:42.170695 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Webkit: enabled, Chromium: enabled
2024-06-25 20:39:42.171427 [-] Site starting on 8050
2024-06-25 20:39:42.171615 [-] Starting factory <twisted.web.server.Site object at 0x7f96c40ae5c0>
2024-06-25 20:39:42.172103 [-] Server listening on http://0.0.0.0:8050
QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once.

Minimum replication

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy_splash import SplashRequest


class ResearchSpider(scrapy.Spider):
    name = "research_spider"

    custom_settings = {
        'SPLASH_URL': 'http://localhost:8050',
        'ROBOTSTXT_OBEY': True,
        'DOWNLOAD_DELAY': 2,
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
    }

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url,
                self.parse
            )

    def parse(self, response):
        print(f"parsing {response.url=}")


def crawl_process(websites: list[str]):
    print(f"Initializing crawler process - {websites=}")
    process = CrawlerProcess()
    process.crawl(ResearchSpider, start_urls=websites)
    process.start()
    print(f"Completed crawl")


if __name__ == "__main__":
    crawl_process([
        "http://www.crazyplumbers.com/",
    ])

JWBWork avatar Jun 25 '24 20:06 JWBWork