scrapy-playwright icon indicating copy to clipboard operation
scrapy-playwright copied to clipboard

ValueError: Page.title: The future belongs to a different loop than the one specified as the loop argument

Open ManHand1996 opened this issue 2 months ago • 6 comments

example:

import scrapy

class MySpider(scrapy.Spider):
    name = "my_spider"
    start_urls = ['https://www.basketball-reference.com/leagues/NBA_2022.html']

    async def start(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, meta={'playwright': True, 'playwright_include_page': True})

    async def parse(self, response):
        
        page = response.meta['playwright_page']

        # 使用 Playwright 的 PageCoroutine 来确保异步调用在正确的事件循环中
        title = await page.title()  # 获取页面标题
        self.logger.info(f"Page Title: {title}")

        # 获取 cookies
        cookies = await page.context.cookies()  # 获取 cookies
        self.logger.info(f"Cookies: {cookies}")

        # 继续其他的爬虫逻辑
        yield {'title': title, 'cookies': cookies}

Version: scrapy: 2.13.3 scrapy-playwright: 0.0.44 playwright: 1.55.0

I just want to get cookie with playwright.Page, but it doesn't work. It's seen scrapy async conflicted with playwright. Pls help , thx.

ManHand1996 avatar Sep 27 '25 12:09 ManHand1996

I'm sorry, I cannot reproduce with the following software versions:

$ scrapy version -v
Scrapy       : 2.13.3
lxml         : 6.0.0
libxml2      : 2.14.4
cssselect    : 1.3.0
parsel       : 1.10.0
w3lib        : 2.3.1
Twisted      : 25.5.0
Python       : 3.12.3 (main, Jun 10 2024, 14:59:09) [GCC 11.4.0]
pyOpenSSL    : 25.1.0 (OpenSSL 3.5.2 5 Aug 2025)
cryptography : 45.0.6
Platform     : Linux-6.5.0-45-generic-x86_64-with-glibc2.35

$ python -c "import scrapy; print(scrapy.__version__)" 
2.13.3

$ playwright --version
Version 1.55.0

Please provide the software versions you are using and additional logs.

Are you maybe using Windows? I usually rely on the Windows CI as I don't have quick access to a Windows system to develop directly on it. The main difference on Windows is the use of a separate threaded loop implementation, and the issue title could point in that direction. However I forced the threaded loop in this example (by setting _PLAYWRIGHT_THREADED_LOOP=True, a private undocumented setting intended only for tests) and the crawl finished successfully.


Logs excerpt:

(...)
2025-09-29 11:16:14 [scrapy.core.engine] INFO: Closing spider (finished)
2025-09-29 11:16:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 250,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1087980,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 7.210502,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2025, 9, 29, 14, 16, 14, 629719, tzinfo=datetime.timezone.utc),
 'item_scraped_count': 1,
 'items_per_minute': 8.571428571428571,
 'log_count/DEBUG': 794,
 'log_count/INFO': 15,
 'log_count/WARNING': 1,
 'memusage/max': 75010048,
 'memusage/startup': 75010048,
 'playwright/browser_count': 1,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/persistent/False': 1,
 'playwright/context_count/remote/False': 1,
 'playwright/page_count': 1,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 397,
 'playwright/request_count/method/GET': 369,
 'playwright/request_count/method/HEAD': 1,
 'playwright/request_count/method/POST': 27,
 'playwright/request_count/navigation': 81,
 'playwright/request_count/resource_type/document': 81,
 'playwright/request_count/resource_type/fetch': 55,
 'playwright/request_count/resource_type/image': 200,
 'playwright/request_count/resource_type/other': 4,
 'playwright/request_count/resource_type/script': 36,
 'playwright/request_count/resource_type/stylesheet': 2,
 'playwright/request_count/resource_type/xhr': 19,
 'playwright/response_count': 390,
 'playwright/response_count/method/GET': 363,
 'playwright/response_count/method/HEAD': 1,
 'playwright/response_count/method/POST': 26,
 'playwright/response_count/resource_type/document': 81,
 'playwright/response_count/resource_type/fetch': 54,
 'playwright/response_count/resource_type/image': 195,
 'playwright/response_count/resource_type/other': 4,
 'playwright/response_count/resource_type/script': 35,
 'playwright/response_count/resource_type/stylesheet': 2,
 'playwright/response_count/resource_type/xhr': 19,
 'response_received_count': 1,
 'responses_per_minute': 8.571428571428571,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2025, 9, 29, 14, 16, 7, 419217, tzinfo=datetime.timezone.utc)}
2025-09-29 11:16:14 [scrapy.core.engine] INFO: Spider closed (finished)
2025-09-29 11:16:14 [scrapy-playwright] INFO: Closing download handler
2025-09-29 11:16:14 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False, remote=False)
2025-09-29 11:16:14 [scrapy-playwright] INFO: Closing browser
2025-09-29 11:16:14 [scrapy-playwright] DEBUG: Browser disconnected

elacuesta avatar Sep 29 '25 14:09 elacuesta

I'm sorry, I cannot reproduce with the following software versions:

$ scrapy version -v
Scrapy       : 2.13.3
lxml         : 6.0.0
libxml2      : 2.14.4
cssselect    : 1.3.0
parsel       : 1.10.0
w3lib        : 2.3.1
Twisted      : 25.5.0
Python       : 3.12.3 (main, Jun 10 2024, 14:59:09) [GCC 11.4.0]
pyOpenSSL    : 25.1.0 (OpenSSL 3.5.2 5 Aug 2025)
cryptography : 45.0.6
Platform     : Linux-6.5.0-45-generic-x86_64-with-glibc2.35

$ python -c "import scrapy; print(scrapy.__version__)" 
2.13.3

$ playwright --version
Version 1.55.0

Please provide the software versions you are using and additional logs.

Are you maybe using Windows? I usually rely on the Windows CI as I don't have quick access to a Windows system to develop directly on it. The main difference on Windows is the use of a separate threaded loop implementation, and the issue title could point in that direction. However I forced the threaded loop in this example (by setting _PLAYWRIGHT_THREADED_LOOP=True, a private undocumented setting intended only for tests) and the crawl finished successfully.

Logs excerpt:

(...)
2025-09-29 11:16:14 [scrapy.core.engine] INFO: Closing spider (finished)
2025-09-29 11:16:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 250,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1087980,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 7.210502,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2025, 9, 29, 14, 16, 14, 629719, tzinfo=datetime.timezone.utc),
 'item_scraped_count': 1,
 'items_per_minute': 8.571428571428571,
 'log_count/DEBUG': 794,
 'log_count/INFO': 15,
 'log_count/WARNING': 1,
 'memusage/max': 75010048,
 'memusage/startup': 75010048,
 'playwright/browser_count': 1,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/persistent/False': 1,
 'playwright/context_count/remote/False': 1,
 'playwright/page_count': 1,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 397,
 'playwright/request_count/method/GET': 369,
 'playwright/request_count/method/HEAD': 1,
 'playwright/request_count/method/POST': 27,
 'playwright/request_count/navigation': 81,
 'playwright/request_count/resource_type/document': 81,
 'playwright/request_count/resource_type/fetch': 55,
 'playwright/request_count/resource_type/image': 200,
 'playwright/request_count/resource_type/other': 4,
 'playwright/request_count/resource_type/script': 36,
 'playwright/request_count/resource_type/stylesheet': 2,
 'playwright/request_count/resource_type/xhr': 19,
 'playwright/response_count': 390,
 'playwright/response_count/method/GET': 363,
 'playwright/response_count/method/HEAD': 1,
 'playwright/response_count/method/POST': 26,
 'playwright/response_count/resource_type/document': 81,
 'playwright/response_count/resource_type/fetch': 54,
 'playwright/response_count/resource_type/image': 195,
 'playwright/response_count/resource_type/other': 4,
 'playwright/response_count/resource_type/script': 35,
 'playwright/response_count/resource_type/stylesheet': 2,
 'playwright/response_count/resource_type/xhr': 19,
 'response_received_count': 1,
 'responses_per_minute': 8.571428571428571,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2025, 9, 29, 14, 16, 7, 419217, tzinfo=datetime.timezone.utc)}
2025-09-29 11:16:14 [scrapy.core.engine] INFO: Spider closed (finished)
2025-09-29 11:16:14 [scrapy-playwright] INFO: Closing download handler
2025-09-29 11:16:14 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False, remote=False)
2025-09-29 11:16:14 [scrapy-playwright] INFO: Closing browser
2025-09-29 11:16:14 [scrapy-playwright] DEBUG: Browser disconnected
❯  scrapy version -v
Scrapy       : 2.13.3
lxml         : 6.0.2
libxml2      : 2.11.9
cssselect    : 1.3.0
parsel       : 1.10.0
w3lib        : 2.3.1
Twisted      : 25.5.0
Python       : 3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)]
pyOpenSSL    : 25.3.0 (OpenSSL 3.5.3 16 Sep 2025)
cryptography : 46.0.1
Platform     : Windows-10-10.0.26100-SP0

❯ playwright --version
Version 1.55.0

set ‘_PLAYWRIGHT_THREADED_LOOP=True’ doesn't work, and I found scrapy settings TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" doesn't support ‘ProactorEventLoop’, but this plugin create playwright use ProactorEventLoop. So I try to change event loop:

# settings.py
ASYNCIO_EVENT_LOOP = "asyncio.windows_events.SelectorEventLoop
# scrapy-playwright/_utils.py
class _ThreadedLoopAdapter:
    ......

    def start(cls, caller_id: int) -> None:
            cls._stop_events[caller_id] = asyncio.Event()
            if not getattr(cls, "_loop", None):
                policy = asyncio.DefaultEventLoopPolicy()
                if platform.system() == "Windows":
                    # policy = asyncio.WindowsProactorEventLoopPolicy()  # type: ignore[attr-defined]
                    policy = asyncio.WindowsSelectorEventLoopPolicy()
                cls._loop = policy.new_event_loop()

But show this error: Does any other event make it work? even less performance. Thanks.

2025-10-19 00:15:17 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method DownloadHandlers._close of <scrapy.core.downloader.handlers.DownloadHandlers object at 0x000001F3B6157010>>
Traceback (most recent call last):
  File "D:\Projects\part_time_project\.env\lib\site-packages\twisted\internet\defer.py", line 1853, in _inlineCallbacks
    result = context.run(
  File "D:\Projects\part_time_project\.env\lib\site-packages\twisted\python\failure.py", line 467, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "D:\Projects\part_time_project\.env\lib\site-packages\scrapy\core\downloader\handlers\__init__.py", line 109, in _close
    yield dh.close()
  File "D:\Projects\part_time_project\.env\lib\site-packages\twisted\internet\defer.py", line 1853, in _inlineCallbacks
    result = context.run(
  File "D:\Projects\part_time_project\.env\lib\site-packages\twisted\python\failure.py", line 467, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "D:\Projects\part_time_project\.env\lib\site-packages\scrapy_playwright\handler.py", line 355, in close
    yield self._deferred_from_coro(self._close())
  File "D:\Projects\part_time_project\.env\lib\site-packages\scrapy_playwright\_utils.py", line 123, in _handle_coro
    result = await coro
  File "D:\Projects\part_time_project\.env\lib\site-packages\scrapy_playwright\handler.py", line 367, in _close
    await self.playwright_context_manager.__aexit__()
  File "D:\Projects\part_time_project\.env\lib\site-packages\playwright\async_api\_context_manager.py", line 57, in __aexit__
    await self._connection.stop_async()
  File "D:\Projects\part_time_project\.env\lib\site-packages\playwright\_impl\_connection.py", line 321, in stop_async
    self._transport.request_stop()
  File "D:\Projects\part_time_project\.env\lib\site-packages\playwright\_impl\_transport.py", line 97, in request_stop
    assert self._output
AttributeError: 'PipeTransport' object has no attribute '_output'

ManHand1996 avatar Oct 18 '25 16:10 ManHand1996

There's no need to change the event loop, this is handled automatically by scrapy-playwright: see the notes about Windows support in the readme: https://github.com/scrapy-plugins/scrapy-playwright/tree/v0.0.44#windows-support

There is also no need to use _PLAYWRIGHT_THREADED_LOOP if you're already on Windows. As mentioned in the docs I linked above, Windows is supported by running the Playwright process in a separate thread. What _PLAYWRIGHT_THREADED_LOOP does is force that approach even if it's not strictly necessary, for testing purposes.

Are you sure you're using scrapy-playwright version 0.0.44? I didn't realize before that this looks exactly like #307, which was solved only in v0.0.44.

$ python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
0.0.44

elacuesta avatar Oct 23 '25 01:10 elacuesta

got the same issue on windows (.venv) PS C:\code\Company\autohome_spider> scrapy version -v
Scrapy : 2.13.3 lxml : 6.0.2 libxml2 : 2.11.9 cssselect : 1.3.0 parsel : 1.10.0 w3lib : 2.3.1 Twisted : 25.5.0 Python : 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] pyOpenSSL : 25.3.0 (OpenSSL 3.5.4 30 Sep 2025) cryptography : 46.0.3 Platform : Windows-10-10.0.19045-SP0

(.venv) PS C:\code\Company\autohome_spider> python -c "import scrapy; print(scrapy.version)" 2.13.3

(.venv) PS C:\code\Company\autohome_spider> playwright --version Version 1.55.0

CAH-FlyChen avatar Nov 05 '25 07:11 CAH-FlyChen

https://github.com/scrapy-plugins/scrapy-playwright/issues/307 still got the same error may be the problem is realated with operation system

CAH-FlyChen avatar Nov 05 '25 07:11 CAH-FlyChen

use wsl everything is ok 。Only windows got error。

CAH-FlyChen avatar Nov 05 '25 09:11 CAH-FlyChen