pydoll icon indicating copy to clipboard operation
pydoll copied to clipboard

Feature: Scrapy plugin for Pydoll (`scrapy-pydoll`)

Open thalissonvs opened this issue 4 months ago • 1 comments

Make it trivial to use Pydoll inside Scrapy without custom glue code. The plugin should let a spider opt-in per request to drive a headless tab, run small actions (clicks, waits), and return a rendered HtmlResponse that plays nicely with Scrapy selectors. It should feel like standard Scrapy, just powered by Pydoll when needed.

Proposed API

  • Installable optional plugin: pip install scrapy-pydoll
  • Enable via settings:
PYDOLL_ENABLED = True
PYDOLL_CONCURRENCY = 2
PYDOLL_BROWSER_OPTIONS = { "geolocation": "GB", "headless": True }
  • Per-request opt-in (meta) or helper Request:
yield scrapy.Request(
    url,
    meta={
        "pydoll": {
            "actions": [
                {"type": "wait", "for": "networkidle"},
                {"type": "click", "selector": "#show-more"},
            ],
            "timeout": 15000,
        },
        "cookiejar": "sessionA",
    },
    callback=self.parse_page,
)

# or
yield PydollRequest(url, actions=[...], timeout=15000)

Requirements (MVP)

  • Deterministic rendered HtmlResponse compatible with .css() / .xpath()
  • Wait strategies: networkidle, selector, sleep(ms)
  • Small action set: click, type, scroll
  • Per-request headers/cookies merged with Pydoll context
  • Session reuse by cookiejar; graceful shutdown on spider_closed
  • Timeouts, retries surfaced as IgnoreRequest or similar

Follow-ups

  • Optionally attach Markdown (return_markdown=True) once exporter exists
  • Network record on error (integration with recorder feature)
  • Page bundle snapshot on exception for offline debugging
  • WebPoet/Scrapy-Poet provider to inject a Tab or rendered HTML

Example Spider

class ExampleSpider(scrapy.Spider):
    name = "example"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com/products",
            meta={"pydoll": {
                "actions": [{"type": "wait", "for": "networkidle"}],
                "timeout": 15000
            }},
            callback=self.parse_list
        )

    def parse_list(self, response):
        for href in response.css(".item a::attr(href)").getall():
            yield scrapy.Request(
                response.urljoin(href),
                meta={"pydoll": {"actions": [{"type": "click", "selector": "#accept"}]}},
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "price": response.css(".price::text").get(),
        }

thalissonvs avatar Aug 22 '25 05:08 thalissonvs

ill take it

LucasAlvws avatar Nov 29 '25 20:11 LucasAlvws