scrapy-playwright icon indicating copy to clipboard operation
scrapy-playwright copied to clipboard

PLAYWRIGHT_ABORT_REQUEST not working well when PLAYWRIGHT_BROWSER_TYPE as 'webkit'

Open partyspy opened this issue 2 months ago • 1 comments

Environment

  • python 3.10
  • OS: macOS 14.1.1, Ubuntu 22.04 LTS
  • playwright Version 1.42.0

When PLAYWRIGHT_BROWSER_TYPE set as 'chromium' (or default) under macOS, , there appears to be a memory leak as number of crawled pages increased. Meanwhile no memory leak is found under Linux.

When PLAYWRIGHT_BROWSER_TYPE set as 'webkit' under macOS, the memory leak issue is gone but the PLAYWRIGHT_ABORT_REQUEST callback fails to intercept the most parts of requests.

def should_abort_request(request):
        return (
            request.resource_type == "image"
            or ".jpg" in request.url
            or "ajax1" in request.url
            or "ajax2" in request.url
            or "ajax3" in request.url
        )

# Spider settings regarding playerwright:

custom_settings = {
        'DOWNLOAD_HANDLERS' : {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        'TWISTED_REACTOR': "twisted.internet.asyncioreactor.AsyncioSelectorReactor",

        'PLAYWRIGHT_BROWSER_TYPE': "webkit", 
        'PLAYWRIGHT_ABORT_REQUEST': should_abort_request,
}

# The Request meta set as:
meta={
    "playwright": True, 
    "playwright_page_goto_kwargs": {"wait_until": "networkidle"}
},

partyspy avatar Apr 09 '24 11:04 partyspy