scrapy-playwright
scrapy-playwright copied to clipboard
PLAYWRIGHT_ABORT_REQUEST not working well when PLAYWRIGHT_BROWSER_TYPE as 'webkit'
Environment
- python 3.10
- OS: macOS 14.1.1, Ubuntu 22.04 LTS
- playwright Version 1.42.0
When PLAYWRIGHT_BROWSER_TYPE set as 'chromium' (or default) under macOS, , there appears to be a memory leak as number of crawled pages increased. Meanwhile no memory leak is found under Linux.
When PLAYWRIGHT_BROWSER_TYPE set as 'webkit' under macOS, the memory leak issue is gone but the PLAYWRIGHT_ABORT_REQUEST callback fails to intercept the most parts of requests.
def should_abort_request(request):
return (
request.resource_type == "image"
or ".jpg" in request.url
or "ajax1" in request.url
or "ajax2" in request.url
or "ajax3" in request.url
)
# Spider settings regarding playerwright:
custom_settings = {
'DOWNLOAD_HANDLERS' : {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
'TWISTED_REACTOR': "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
'PLAYWRIGHT_BROWSER_TYPE': "webkit",
'PLAYWRIGHT_ABORT_REQUEST': should_abort_request,
}
# The Request meta set as:
meta={
"playwright": True,
"playwright_page_goto_kwargs": {"wait_until": "networkidle"}
},