scrapy-playwright icon indicating copy to clipboard operation
scrapy-playwright copied to clipboard

ERR_INVALID_ARGUMENT on any url

Open dragospopa420 opened this issue 1 year ago • 2 comments

Hello, I have install scrapy-playwright in my venv using pip install scrapy-playwright and after that playwright install. Whatever I try to do I get the ERR_INVALID_ARGUMENT error for any url. Even when I try something basic like:

import scrapy
from project.settings import DOWNLOADER_MIDDLEWARES

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    CUSTOM_DOWNLOADER_MIDDLEWARES = DOWNLOADER_MIDDLEWARES.copy()
    CUSTOM_DOWNLOADER_MIDDLEWARES.update({
        'project.middlewares.BrightDataProxyMiddleware': 900,
        'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
        'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': None,
        'tor_ip_rotator.middlewares.TorProxyMiddleware': None,
    })

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        'DOWNLOADER_MIDDLEWARES': CUSTOM_DOWNLOADER_MIDDLEWARES,
        'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0',

    }

    def start_requests(self):
        url = "https://quotes.toscrape.com/js/"
        yield scrapy.Request(url, meta={'playwright': True})

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall()
            }

It ends with :

2022-09-07 16:26:48,375 scrapy.extensions.telnet - INFO:Telnet console listening on 127.0.0.1:6053
2022-09-07 16:26:48,376 scrapy-playwright - INFO:Starting download handler
2022-09-07 16:26:53,376 scrapy-playwright - INFO:Launching browser chromium
2022-09-07 16:26:53,742 scrapy-playwright - INFO:Browser chromium launched
2022-09-07 16:26:53,751 scrapy-playwright - DEBUG:Browser context started: 'default' (persistent=False)
2022-09-07 16:26:53,998 scrapy-playwright - DEBUG:[Context=default] New page created, page count is 1 (1 for all contexts)
2022-09-07 16:26:54,020 scrapy-playwright - DEBUG:[Context=default] Request: <GET https://quotes.toscrape.com/js/> (resource type: document, referrer: None)
2022-09-07 16:26:54,027 scrapy-playwright - WARNING:Closing page due to failed request: <GET https://quotes.toscrape.com/js/> (<class 'playwright._impl._api_types.Error'>)
2022-09-07 16:26:54,139 scrapy.core.scraper - ERROR:Error downloading <GET https://quotes.toscrape.com/js/>
Traceback (most recent call last):
  File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
    result = current_context.run(
  File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/twisted/internet/defer.py", line 1030, in adapt
    extracted = result.result()
  File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 275, in _download_request
    result = await self._download_request_with_page(request, page)
  File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/scrapy_playwright/handler.py", line 293, in _download_request_with_page
    response = await page.goto(url=request.url, **page_goto_kwargs)
  File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/playwright/async_api/_generated.py", line 7413, in goto
    await self._impl_obj.goto(
  File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/playwright/_impl/_page.py", line 496, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/playwright/_impl/_frame.py", line 136, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 43, in send
    return await self._connection.wrap_api_call(
  File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 387, in wrap_api_call
    return await cb()
  File "/Users/dragospopa/Desktop/project/scrapingvenv/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 78, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: net::ERR_INVALID_ARGUMENT at https://quotes.toscrape.com/js/
=========================== logs ===========================
navigating to "https://quotes.toscrape.com/js/", waiting until "load"
============================================================
2022-09-07 16:26:54,248 scrapy.core.engine - INFO:Closing spider (finished)

How can I fix this ?

(Edited for syntax highlighting)

dragospopa420 avatar Sep 07 '22 13:09 dragospopa420

I suspect project.middlewares.BrightDataProxyMiddleware might be trying to set proxy configuration via Request.meta["proxy"], which isn't supported (see also Known issues). It works well for me without specifying DOWNLOADER_MIDDLEWARES.

elacuesta avatar Sep 07 '22 18:09 elacuesta

That BrightDataProxyMiddleware only gets activated if I specify in custom settings BRIGHTDATA_ENABLED: True, otherwise it doesn't do anything. Also just to be sure I changed the settings to those below and I still have the same error

    CUSTOM_DOWNLOADER_MIDDLEWARES = {}
    CUSTOM_DOWNLOADER_MIDDLEWARES.update({
        'project.middlewares.BrightDataProxyMiddleware': None,
        'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
        'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': None,
        'tor_ip_rotator.middlewares.TorProxyMiddleware': None,
    })

   custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        'DOWNLOADER_MIDDLEWARES': CUSTOM_DOWNLOADER_MIDDLEWARES,
        'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0',

    }

dragospopa420 avatar Sep 08 '22 06:09 dragospopa420

Getting the same error, any fixes?

mianemad avatar Sep 25 '22 13:09 mianemad

I still could not reproduce.

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "DOWNLOADER_MIDDLEWARES": {
            "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": 400,
        },
        "USER_AGENT": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0",
    }

    def start_requests(self):
        url = "https://quotes.toscrape.com/js/"
        yield scrapy.Request(url, meta={"playwright": True})

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

(Note that I'm not including all non built-in middlewares, disabling them with None is the same as not adding them in the first place).

$ scrapy crawl quotes -o quotes.json
(...)
2022-09-27 14:47:38 [scrapy.core.engine] INFO: Spider opened
2022-09-27 14:47:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-09-27 14:47:38 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-09-27 14:47:38 [scrapy-playwright] INFO: Starting download handler
2022-09-27 14:47:44 [scrapy-playwright] INFO: Launching browser chromium
2022-09-27 14:47:44 [scrapy-playwright] INFO: Browser chromium launched
2022-09-27 14:47:46 [scrapy.core.engine] INFO: Closing spider (finished)
2022-09-27 14:47:46 [scrapy.extensions.feedexport] INFO: Stored json feed (10 items) in: quotes.json
(...)
2022-09-27 14:47:46 [scrapy.core.engine] INFO: Spider closed (finished)
2022-09-27 14:47:46 [scrapy-playwright] INFO: Closing download handler
2022-09-27 14:47:47 [scrapy-playwright] INFO: Closing browser
$ scrapy version -v
Scrapy       : 2.6.0
lxml         : 4.8.0.0
libxml2      : 2.9.12
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 22.2.0
Python       : 3.9.6 (default, Sep  6 2021, 10:09:19) - [GCC 7.5.0]
pyOpenSSL    : 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021)
cryptography : 36.0.1
Platform     : Linux-5.4.0-125-generic-x86_64-with-glibc2.31

$ python -c "import scrapy_playwright; print(scrapy_playwright.__version__)"
0.0.21

elacuesta avatar Sep 27 '22 17:09 elacuesta