scrapy-playwright
scrapy-playwright copied to clipboard
inspect_response not working in spider with scrapy_playwright
Hi all,
I have a simple example below which should work but doesn't.
class AwesomeSpider(scrapy.Spider):
name = "test-playwright"
def start_requests(self):
yield scrapy.Request("https://quotes.toscrape.com/", meta={
"playwright": True,
})
def parse(self, response):
inspect_response(response, self)
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
It gives the errors in the shell:
..........
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
**2022-08-15 11:17:16 [scrapy.core.scraper] ERROR: Spider error processing <GET [https://quotes.toscrape.com/>](https://quotes.toscrape.com/%3E) (referer: https://fonts.googleapis.com/)
Traceback (most recent call last):
File "/python/scrapy-projects/rightmove/venv/lib/python3.9/site-packages/twisted/internet/defer.py", line 1030, in adapt
extracted = result.result()
2022-08-15 11:17:16 [py.warnings] WARNING: /python/scrapy-projects/rightmove/venv/lib/python3.9/site-packages/IPython/core/displayhook.py:311: RuntimeWarning: coroutine 'Application.run_async' was never awaited
gc.collect()**
However, everything works fine if I run the scrapy shell initially directly from command line like so: scrapy shell 'https://quotes.toscrape.com'
Any ideas? I'm stumped, I think it's something to do with asyncio. Thanks,
(edited for syntax highlighting)
The traceback does not seem to be exactly the same, however the whole situation looks very similar to https://github.com/scrapy/scrapy/issues/5447. I see mentions of ipython in your post, I'd recommend trying with the regular interpreter by setting the SCRAPY_PYTHON_SHELL=python env variable.
I'm getting a working shell with the following:
import scrapy
class AwesomeSpider(scrapy.Spider):
name = "test-playwright"
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
}
def start_requests(self):
yield scrapy.Request(
url="https://quotes.toscrape.com/",
meta={"playwright": True},
)
def parse(self, response):
scrapy.shell.inspect_response(response, self)
$ SCRAPY_PYTHON_SHELL=python scrapy runspider shell.py
(...)
2023-08-25 19:57:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/> (referer: None) ['playwright']
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f561431b340>
[s] item {}
[s] request <GET https://quotes.toscrape.com/>
[s] response <200 https://quotes.toscrape.com/>
[s] settings <scrapy.settings.Settings object at 0x7f561431b790>
[s] spider <AwesomeSpider 'test-playwright' at 0x7f5613f0fac0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>> len(response.css('div.quote'))
10
>>>
I'm closing this issue as caused by the upstream https://github.com/scrapy/scrapy/issues/5447