scrapy-playwright
scrapy-playwright copied to clipboard
🎭 Playwright integration for Scrapy
Closes #307 Tasks: - [x] implementation (I'm working on a callback decorator instead) - [ ] tests - [ ] docs
Hi, I'm using scrapy-playwright for data scraping, where URLs are provided through a txt file. I've noticed that every time a URL is scraped, the browser restarts, which significantly reduces...
Is there a way to take a screenshot for a `process_spider_exception` error? I can't figure out how to access the page object in that middleware.
```python # example.py import scrapy from playwright.async_api import Page class ExampleSpider(scrapy.Spider): name = "example" custom_settings = { "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor", "DOWNLOAD_HANDLERS": { "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler", }, "_PLAYWRIGHT_THREADED_LOOP": True, # private setting, used...
Cannot close spider through SIGINT (ctrl+c) My code: ```python import scrapy from scrapy.linkextractors import LinkExtractor from scrapy_playwright.page import PageMethod meta={ 'playwright': True, 'playwright_include_page': True, 'playwright_page_methods': [PageMethod('wait_for_load_state','networkidle')] } async def error_back(failure):...
I just create an example spider. Chromium works well. but with the setup below. it's raise `NS_ERROR_PROXY_CONNECTION_REFUSED` from `playwright._impl._errors.Error: Page.goto: NS_ERROR_PROXY_CONNECTION_REFUSED` Debug to in ScrapyPlaywrightDownloadHandler._maybe_launch_browser and i got launch_options. ```python...
This pull request includes a small but important change to the `scrapy_playwright/handler.py` file. The change involves modifying the handling of download headers to remove the `Content-Encoding` header before creating the...
An exception is thrown because of the wrong content encoding when I fire the download event and get the file response. ```python 2024-10-08 15:26:39 [scrapy.core.scraper] ERROR: Error downloading Traceback (most...
I am using scrapy-playwright with latest versions on the webkit browser on ubuntu 22.04. I can start and debug the spider once or twice. Trying to stop it using the...
Hi @elacuesta, still loving this library! :) I often find myself having to deal with the Playwright page in my request callback because I need to perform some page actions...