scrapy-playwright
scrapy-playwright copied to clipboard
Issue running scrape on Mac
I seem to be getting the following issue but I am unsure why the argument passed is invalid?
Model Name: MacBook Pro Model Identifier: Mac14,7 Model Number: MNEJ3LL/A Chip: Apple M2 Total Number of Cores: 8 (4 performance and 4 efficiency) Memory: 8 GB System Firmware Version: 8419.80.7 OS Loader Version: 8419.80.7 System Version: macOS 13.2.1 (22D68) Kernel Version: Darwin 22.3.0 Boot Volume: Macintosh HD Boot Mode: Normal
See code:
def parse(self, response):
# Check if the page is a log-in or authentication page
if self.is_login_page(response):
self.logger.info(f"Ignoring log-in page: {response.url}")
return
print("Extracting")
# Extract data from the current page
extracted_data = self.extract_data(response)
print("Update Meta")
# Insert the entire response into the database
self.update_meta(response.url, extracted_data)
print("Adding Url", response.url)
self.visited_urls.add(response.url)
yield extracted_data
print("View Response", response)
# Extracting links to other pages
for link in response.css("a::attr(href)").getall():
absolute_url = urljoin(response.url, link)
if absolute_url.startswith("javascript:"):
continue # Ignore JavaScript links
if absolute_url not in self.visited_urls:
print("Run Req", absolute_url)
self.visited_urls.add(absolute_url) # Avoid re-scrape now that we're running request for this link
yield scrapy.Request(
url=absolute_url, callback=self.parse, errback=self.error_handler, meta={"playwright": True}
)
See error below. I am unsure what to make of it besides a bad argument but where?
New Url {'domain': 'www.acnestudios.com', 'raw': 'https://www.acnestudios.com/ca/en/twill-trousers-dark-grey/BK0589-AA3', 'lastScrape': datetime.datetime(2024, 3, 16, 15, 25, 27, 347681)}
2024-03-16 15:25:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.acnestudios.com/robots.txt> (referer: None)
2024-03-16 15:25:30 [scrapy-playwright] INFO: Launching browser chromium
2024-03-16 15:25:31 [scrapy-playwright] INFO: Browser chromium launched
2024-03-16 15:25:31 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-03-16 15:25:31 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-03-16 15:25:31 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://www.acnestudios.com/ca/en/twill-trousers-dark-grey/BK0589-AA3> (resource type: document)
2024-03-16 15:25:31 [scrapy-playwright] WARNING: Closing page due to failed request: <GET https://www.acnestudios.com/ca/en/twill-trousers-dark-grey/BK0589-AA3> exc_type=<class 'playwright._impl._errors.Error'> exc_msg=net::ERR_INVALID_ARGUMENT at https://www.acnestudios.com/ca/en/twill-trousers-dark-grey/BK0589-AA3
Traceback (most recent call last):
File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 340, in _download_request
return await self._download_request_with_page(request, page, spider)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 369, in _download_request_with_page
response, download = await self._get_response_and_download(request=request, page=page)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 461, in _get_response_and_download
response = await page.goto(url=request.url, **page_goto_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/playwright/async_api/_generated.py", line 8612, in goto
await self._impl_obj.goto(
File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/playwright/_impl/_page.py", line 500, in goto
return await self._main_frame.goto(**locals_to_params(locals()))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/playwright/_impl/_frame.py", line 145, in goto
await self._channel.send("goto", locals_to_params(locals()))
File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 59, in send
return await self._connection.wrap_api_call(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 509, in wrap_api_call
return await cb()
^^^^^^^^^^
File "/Users/user/Projects/tycoon-web-scraper/tycoon-web-crawler-python-base/scraper/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 97, in inner_send
result = next(iter(done)).result()
^^^^^^^^^^^^^^^^^^^^^^^^^
playwright._impl._errors.Error: net::ERR_INVALID_ARGUMENT at https://www.acnestudios.com/ca/en/twill-trousers-dark-grey/BK0589-AA3
2024-03-16 15:25:31 [TycoonSpider] ERROR: Error: net::ERR_INVALID_ARGUMENT at https://www.acnestudios.com/ca/en/twill-trousers-dark-grey/BK0589-AA3
2024-03-16 15:25:31 [scrapy.core.engine] INFO: Closing spider (finished)