scrapy-playwright
scrapy-playwright copied to clipboard
ERR_INVALID_ARGUMENT on any url
Hello,
I have install scrapy-playwright in my venv using pip install scrapy-playwright
and after that playwright install
.
Whatever I try to do I get the ERR_INVALID_ARGUMENT error for any url. Even when I try something basic like:
import scrapy
from project.settings import DOWNLOADER_MIDDLEWARES
class QuotesSpider(scrapy.Spider):
name = 'quotes'
CUSTOM_DOWNLOADER_MIDDLEWARES = DOWNLOADER_MIDDLEWARES.copy()
CUSTOM_DOWNLOADER_MIDDLEWARES.update({
'project.middlewares.BrightDataProxyMiddleware': 900,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': None,
'tor_ip_rotator.middlewares.TorProxyMiddleware': None,
})
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
'DOWNLOADER_MIDDLEWARES': CUSTOM_DOWNLOADER_MIDDLEWARES,
'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0',
}
def start_requests(self):
url = "https://quotes.toscrape.com/js/"
yield scrapy.Request(url, meta={'playwright': True})
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall()
}
It ends with :
2022-09-07 16:26:48,375 scrapy.extensions.telnet - INFO:Telnet console listening on 127.0.0.1:6053
2022-09-07 16:26:48,376 scrapy-playwright - INFO:Starting download handler
2022-09-07 16:26:53,376 scrapy-playwright - INFO:Launching browser chromium
2022-09-07 16:26:53,742 scrapy-playwright - INFO:Browser chromium launched
2022-09-07 16:26:53,751 scrapy-playwright - DEBUG:Browser context started: 'default' (persistent=False)
2022-09-07 16:26:53,998 scrapy-playwright - DEBUG:[Context=default] New page created, page count is 1 (1 for all contexts)
2022-09-07 16:26:54,020 scrapy-playwright - DEBUG:[Context=default] Request: <GET https://quotes.toscrape.com/js/> (resource type: document, referrer: None)
2022-09-07 16:26:54,027 scrapy-playwright - WARNING:Closing page due to failed request: <GET https://quotes.toscrape.com/js/> (<class 'playwright._impl._api_types.Error'>)
2022-09-07 16:26:54,139 scrapy.core.scraper - ERROR:Error downloading <GET https://quotes.toscrape.com/js/>
Traceback (most recent call last):
File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
result = current_context.run(
File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
return (yield download_func(request=request, spider=spider))
File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/twisted/internet/defer.py", line 1030, in adapt
extracted = result.result()
File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 275, in _download_request
result = await self._download_request_with_page(request, page)
File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 293, in _download_request_with_page
response = await page.goto(url=request.url, **page_goto_kwargs)
File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/playwright/async_api/_generated.py", line 7413, in goto
await self._impl_obj.goto(
File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/playwright/_impl/_page.py", line 496, in goto
return await self._main_frame.goto(**locals_to_params(locals()))
File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/playwright/_impl/_frame.py", line 136, in goto
await self._channel.send("goto", locals_to_params(locals()))
File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 43, in send
return await self._connection.wrap_api_call(
File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 387, in wrap_api_call
return await cb()
File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 78, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: net::ERR_INVALID_ARGUMENT at https://quotes.toscrape.com/js/
=========================== logs ===========================
navigating to "https://quotes.toscrape.com/js/", waiting until "load"
============================================================
2022-09-07 16:26:54,248 scrapy.core.engine - INFO:Closing spider (finished)
How can I fix this ?
(Edited for syntax highlighting)
I suspect project.middlewares.BrightDataProxyMiddleware
might be trying to set proxy configuration via Request.meta["proxy"]
, which isn't supported (see also Known issues). It works well for me without specifying DOWNLOADER_MIDDLEWARES
.
That BrightDataProxyMiddleware only gets activated if I specify in custom settings BRIGHTDATA_ENABLED: True
,
otherwise it doesn't do anything.
Also just to be sure I changed the settings to those below and I still have the same error
CUSTOM_DOWNLOADER_MIDDLEWARES = {}
CUSTOM_DOWNLOADER_MIDDLEWARES.update({
'project.middlewares.BrightDataProxyMiddleware': None,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': None,
'tor_ip_rotator.middlewares.TorProxyMiddleware': None,
})
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
'DOWNLOADER_MIDDLEWARES': CUSTOM_DOWNLOADER_MIDDLEWARES,
'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0',
}
Getting the same error, any fixes?
I still could not reproduce.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"DOWNLOADER_MIDDLEWARES": {
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": 400,
},
"USER_AGENT": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0",
}
def start_requests(self):
url = "https://quotes.toscrape.com/js/"
yield scrapy.Request(url, meta={"playwright": True})
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
(Note that I'm not including all non built-in middlewares, disabling them with None
is the same as not adding them in the first place).
$ scrapy crawl quotes -o quotes.json
(...)
2022-09-27 14:47:38 [scrapy.core.engine] INFO: Spider opened
2022-09-27 14:47:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-09-27 14:47:38 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-09-27 14:47:38 [scrapy-playwright] INFO: Starting download handler
2022-09-27 14:47:44 [scrapy-playwright] INFO: Launching browser chromium
2022-09-27 14:47:44 [scrapy-playwright] INFO: Browser chromium launched
2022-09-27 14:47:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-09-27 14:47:46 [scrapy.extensions.feedexport] INFO: Stored json feed (10 items) in: quotes.json
(...)
2022-09-27 14:47:46 [scrapy.core.engine] INFO: Spider closed (finished)
2022-09-27 14:47:46 [scrapy-playwright] INFO: Closing download handler
2022-09-27 14:47:47 [scrapy-playwright] INFO: Closing browser
$ scrapy version -v
Scrapy : 2.6.0
lxml : 4.8.0.0
libxml2 : 2.9.12
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 22.2.0
Python : 3.9.6 (default, Sep 6 2021, 10:09:19) - [GCC 7.5.0]
pyOpenSSL : 22.0.0 (OpenSSL 1.1.1m 14 Dec 2021)
cryptography : 36.0.1
Platform : Linux-5.4.0-125-generic-x86_64-with-glibc2.31
$ python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
0.0.21