scrapy-playwright
scrapy-playwright copied to clipboard
Scrapy hangs with no exception raised
Configurations
Configurations 1: Model Name: MacBook Pro Chip: Apple M1 Max Cores: 10 (8 performance and 2 efficiency) Memory: 32GB System Version: macOS Sonoma 14.2.1 Playwright Version: 1.42.0 Python Version: Python 3.10.14 browser: chromium version 123.0.6312.4 browser: firefox version 123.0
Configurations 2: Model Name: Chip: Intel Xeon Cores: 4 Logical Cores Memory: 8GB System Version: Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-76-generic x86_64) Docker Version: Docker version 25.0.4 Playwright Version: 1.42.0 Python Version: Python 3.10.13 browser: chromium version 123.0.6312.4 browser: firefox version 123.0
Description
I'm experiencing a problem where my Scrapy spider hangs indefinitely on a random url during the crawling process without throwing any exceptions. The spider uses Scrapy Playwright for page rendering. I have tried running my project on both of the above configurations and the same problem occurs.
Actual Behavior
Chromium and firefox will hangs on scraping tasks with different urls.
2024-03-23 10:13:07 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: webspider)
2024-03-23 10:13:07 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.12.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 24.3.0, Python 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.5, Platform macOS-14.2.1-arm64-arm-64bit
2024-03-23 10:13:07 [scrapy.addons] INFO: Enabled addons:
[]
2024-03-23 10:13:07 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-03-23 10:13:07 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-03-23 10:13:07 [scrapy.extensions.telnet] INFO: Telnet Password: 231864415aa317a2
2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2024-03-23 10:13:07 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'webspider',
'CONCURRENT_REQUESTS': 32,
'DNS_TIMEOUT': 6,
'DOWNLOAD_DELAY': 0.25,
'DOWNLOAD_TIMEOUT': 30,
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'webspider.spiders',
'REACTOR_THREADPOOL_MAXSIZE': 16,
'REDIRECT_MAX_TIMES': 5,
'REFERER_ENABLED': False,
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'RETRY_ENABLED': False,
'SPIDER_MODULES': ['webspider.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'webspider.middlewares.RandomUserAgentMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-03-23 10:13:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-03-23 10:13:07 [scrapy.core.engine] INFO: Spider opened
2024-03-23 10:13:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-23 10:13:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-03-23 10:13:07 [scrapy-playwright] INFO: Starting download handler
2024-03-23 10:13:07 [scrapy-playwright] INFO: Starting download handler
2024-03-23 10:13:12 [scrapy-playwright] INFO: Launching browser chromium
2024-03-23 10:13:13 [scrapy-playwright] INFO: Browser chromium launched
2024-03-23 10:13:13 [scrapy-playwright] DEBUG: Browser context started: 'm-3baa5cd4-1614-4a6a-af55-5dbbd9682f29' (persistent=False, remote=False)
2024-03-23 10:13:13 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] New page created, page count is 1 (1 for all contexts)
......
2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 http://bdimg.share.baidu.com/static/api/js/view/view_base.js>
2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 http://bdimg.share.baidu.com/static/api/js/share/api_base.js>
2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Request: <GET http://bdimg.share.baidu.com/static/api/css/share_style0_16.css?v=8105b07e.css> (resource type: stylesheet)
2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 http://bdimg.share.baidu.com/static/api/css/share_style0_16.css?v=8105b07e.css>
2024-03-23 10:14:40 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 https://nsclick.baidu.com/v.gif?pid=307&type=3071&sign=&desturl=https%253A%252F%252Fikikiuu.online%252F&linkid=lu3gj3suojw&apitype=0>
2024-03-23 10:14:41 [scrapy-playwright] DEBUG: [Context=m-3baa5cd4-1614-4a6a-af55-5dbbd9682f29] Response: <200 https://api.share.baidu.com/v.gif>
2024-03-23 10:15:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-23 10:16:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-23 10:17:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-23 10:18:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-23 10:19:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-03-23 10:20:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Code
# SETTINGS
@classmethod
def update_settings(cls, settings: BaseSettings) -> None:
super().update_settings(settings)
# spider
settings.set("REFERER_ENABLED", False)
settings.set("TWISTED_REACTOR", "twisted.internet.asyncioreactor.AsyncioSelectorReactor")
settings.set("DOWNLOAD_HANDLERS", {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
})
settings.set("SPIDER_MIDDLEWARES", {
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
})
settings.set("DOWNLOADER_MIDDLEWARES", {
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
"webspider.middlewares.RandomUserAgentMiddleware": 300,
})
settings.set("ROBOTSTXT_OBEY", False)
settings.set("RETRY_ENABLED", False)
settings.set("REDIRECT_MAX_TIMES", 5)
settings.set("DOWNLOAD_TIMEOUT", 30)
settings.set("DOWNLOAD_DELAY", 0.25)
# playwright
settings.set("PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT", "30000")
settings.set("PLAYWRIGHT_BROWSER_TYPE", "chromium")
settings.set("PLAYWRIGHT_MAX_CONTEXTS", 1)
settings.set("PLAYWRIGHT_MAX_PAGES_PER_CONTEXT", 1)
settings.set("PLAYWRIGHT_DEFAULT_HEADERS", {'Cache-Control': 'no-cache'})
if settings.get("PLAYWRIGHT_BROWSER_TYPE") == "firefox":
prefs = {
"security.mixed_content.block_active_content": False,
"webgl.disabled": True,
"services.sync.engine.addons": False,
"services.sync.engine.bookmarks": False,
"services.sync.engine.history": False,
"services.sync.engine.passwords": False,
"services.sync.engine.prefs": False,
"dom.webaudio.enabled": False,
"extensions.abuseReport.enabled": False,
"dom.webnotifications.enabled": False,
"browser.cache.disk.enable": False,
}
browser_args = [
"-private",
]
settings.set("PLAYWRIGHT_LAUNCH_OPTIONS", {
"headless": False,
"timeout": 30 * 1000,
"devtools": False,
"handle_sigint": False,
"handle_sigterm": False,
"handle_sighup": False,
"args": browser_args,
"firefox_user_prefs": prefs,
})
if settings.get("PLAYWRIGHT_BROWSER_TYPE") == "chromium":
browser_args = [
"--incognito",
"--disable-gpu"
"--disable-sync"
"--disable-apps"
"--disable-audio"
"--disable-plugins"
"--disable-infobars"
"--disable-extensions"
"--disable-translate"
"--disable-geolocation"
"--disable-notifications"
"--disable-winsta"
"--disable-dev-shm-usage"
"--disable-webgl"
"--disable-cache"
"--disable-popup-blocking"
"--disable-back-forward-cache"
"--arc-disable-gms-core-cache"
"--process-per-site"
"--disable-offline-load-stale-cache"
"--disk-cache-size=0"
"--no-sandbox"
"--disable-client-side-phishing-detection"
"--disable-breakpad"
"--ignore-certificate-errors",
"--ignore-urlfetcher-cert-requests"
"--disable-blink-features=AutomationControlled",
"--disable-web-security",
"--allow-running-insecure-content",
]
settings.set("PLAYWRIGHT_LAUNCH_OPTIONS", {
"headless": False,
"timeout": 30 * 1000,
"devtools": False,
"handle_sigint": False,
"handle_sigterm": False,
"handle_sighup": False,
"args": browser_args,
})
def start_requests(self) -> Iterable[Request]:
logging.info("start url: %s" % self.start_urls[0])
task_item = WebTaskItem(
task_id="m-" + str(uuid.uuid4()),
url=self.start_urls[0],
)
yield Request(
url=self.start_urls[0],
meta={
TASK_ITEM_META_KEY: task_item,
"spider": self.name,
RandomUserAgentMiddleware.use_mobile_ua_flag: task_item.is_mobile,
"playwright": True,
"playwright_include_page": True,
"playwright_context": task_item.task_id,
"playwright_context_kwargs": {
"ignore_https_errors": True,
"is_mobile": task_item.is_mobile,
"bypass_csp": True,
},
"playwright_page_init_callback": self.init_page,
"playwright_page_goto_kwargs": {
"wait_until": "load",
"timeout": 30000
},
"handle_httpstatus_all": True,
},
dont_filter=False,
callback=self.response_callback,
errback=self.failure_callback
)
async def response_callback(self, response: Response, **kwargs) -> WebTaskResultItem:
page: Page = response.meta.get("playwright_page", None)
if not page:
raise ValueError("PlaywrightSpider.response_callback: page is None")
try:
info = PageInfoItem()
for processor in self.processors:
# just fetch html / get title / take screenshots from page object
await processor.process(response=response, info=info)
result = WebTaskResultItem(
task_id=response.request.meta[TASK_ITEM_META_KEY].task_id,
page_info=info,
error_message=""
)
logging.info("result: %s" % result.model_dump_json())
yield result
except Exception as e:
logging.warning(f"PlaywrightSpider.response_callback: ex={e}")
result = WebTaskResultItem(
task_id=response.request.meta[TASK_ITEM_META_KEY].task_id,
error_message=str(e)
)
logging.info("result: %s" % result.model_dump_json())
finally:
await page.close()
await page.context.close()
async def failure_callback(self, failure, **kwargs) -> WebTaskResultItem:
page: Page = failure.request.meta.get("playwright_page", None)
if page:
try:
await page.close()
await page.context.close()
except Exception as e:
self.logger.warning(f"stop error:{e}")
result = WebTaskResultItem(
task_id=failure.request.meta[TASK_ITEM_META_KEY].task_id,
error_message=failure.getErrorMessage()
)
logging.info("result: %s" % result.model_dump_json())
yield result
Troubleshooting Steps Taken
- Increased logging level to DEBUG, but no clues were found in the logs.
- Ensured that resources like Playwright pages and contexts are being closed correctly.
- Checked for potential deadlocks in my asynchronous code.
- Ran the spider with
cProfile
to profile the application, but no obvious bottlenecks were identified. - Remove all chromium args or firefox user preference, results are same.
Debugging
I tried to use python debugger to debug the code and I found following information:
- The requests were added to the task queue.
- The debugger shows that the request was processed by scrapy-playwright.
- The browser started correctly.
- The browser render the page correctly.
goto
function was called. - Then the spider hangs without any reasons, not following any timeout rules.
- No more tasks can be processed, only keep printing scraping stats.
- Both the
callback=self.response_callback
anderrback=self.failure_callback
function wasn't called - Main thread loops in event select.
Huge thanks for any provided ideas!
This report is not actionable as it is and should be distilled further, there are several things that are likely not related. Additionally, there are missing symbols and there is no actual URL to test (it is relevant whether or not an issue occurs in every case, in which case a generic URL like https://example.org could be used). Please see https://stackoverflow.com/help/minimal-reproducible-example.
I suspect the issue you're observing is probably related to the PLAYWRIGHT_MAX_CONTEXTS=1
& PLAYWRIGHT_MAX_PAGES_PER_CONTEXT=1
settings you have defined.
I think this problem is not related to the PLAYWRIGHT_MAX_CONTEXTS and PLAYWRIGHT_MAX_PAGES_PER_CONTEXT. The original values of these are 16 and 1. I deliberately set them to 1. The higher value of the PLAYWRIGHT_MAX_CONTEXTS setting only increases the time it takes for all contexts to hang. As for the url to test, it's NSFW, I don't know if it's against the rules to post it here. Just to be clear, I am working on developing a content filter, so I need to collect some datasets. My temporary solution is to create a process pool to run scrapy crawl tasks url by url. And kill the timeout CrawlProcess which most likely hangs.
same problem I can't figure out why, because I can't reproduce it stably.
Even when this occurs, sending a SIGINT signal does not terminate the crawler.
this is traceback when hang
>>> Signal received : entering python shell.
Traceback:
File "/usr/local/bin/scrapy", line 8, in <module>
sys.exit(execute())
File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 161, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 114, in _run_print_help
func(*a, **kw)
File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 169, in _run_command
cmd.run(args, opts)
File "/app/info_scraper/commands/crawl.py", line 24, in run
self.crawler_process.start()
File "/usr/local/lib/python3.10/site-packages/scrapy/crawler.py", line 429, in start
reactor.run(installSignalHandlers=install_signal_handlers) # blocking call
File "/usr/local/lib/python3.10/site-packages/twisted/internet/asyncioreactor.py", line 253, in run
self._asyncioEventloop.run_forever()
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
event_list = self._selector.select(timeout)
File "/usr/local/lib/python3.10/selectors.py", line 469, in select
Signal received : entering python shell.
fd_event_list = self._selector.poll(timeout, max_ev)
Traceback:
File "/usr/local/bin/scrapy", line 8, in <module>
sys.exit(execute())
File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 161, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 114, in _run_print_help
func(*a, **kw)
File "/usr/local/lib/python3.10/site-packages/scrapy/cmdline.py", line 169, in _run_command
cmd.run(args, opts)
File "/app/info_scraper/commands/crawl.py", line 24, in run
self.crawler_process.start()
File "/usr/local/lib/python3.10/site-packages/scrapy/crawler.py", line 429, in start
reactor.run(installSignalHandlers=install_signal_handlers) # blocking call
File "/usr/local/lib/python3.10/site-packages/twisted/internet/asyncioreactor.py", line 253, in run
self._asyncioEventloop.run_forever()
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
event_list = self._selector.select(timeout)
File "/usr/local/lib/python3.10/selectors.py", line 469, in select
fd_event_list = self._selector.poll(timeout, max_ev)
I'm revisiting this issue and the available code is still both too long (e.g. including unrelated settings and classes) and incomplete (e.g. not containing a full working spider). Please provide a minimal, reproducible example, otherwise the issue will be closed.
Closing due to inactivity.
@akkuman @Zhou-Haowei has the problem solved I am facing the same issue where the crawler gets stuck and to continue the process further I have to press Cmd+C manually