pydoll
pydoll copied to clipboard
Feature: Scrapy plugin for Pydoll (`scrapy-pydoll`)
Make it trivial to use Pydoll inside Scrapy without custom glue code. The plugin should let a spider opt-in per request to drive a headless tab, run small actions (clicks, waits), and return a rendered HtmlResponse that plays nicely with Scrapy selectors. It should feel like standard Scrapy, just powered by Pydoll when needed.
Proposed API
- Installable optional plugin:
pip install scrapy-pydoll - Enable via settings:
PYDOLL_ENABLED = True
PYDOLL_CONCURRENCY = 2
PYDOLL_BROWSER_OPTIONS = { "geolocation": "GB", "headless": True }
- Per-request opt-in (meta) or helper Request:
yield scrapy.Request(
url,
meta={
"pydoll": {
"actions": [
{"type": "wait", "for": "networkidle"},
{"type": "click", "selector": "#show-more"},
],
"timeout": 15000,
},
"cookiejar": "sessionA",
},
callback=self.parse_page,
)
# or
yield PydollRequest(url, actions=[...], timeout=15000)
Requirements (MVP)
- Deterministic rendered
HtmlResponsecompatible with.css()/.xpath() - Wait strategies:
networkidle,selector,sleep(ms) - Small action set:
click,type,scroll - Per-request headers/cookies merged with Pydoll context
- Session reuse by
cookiejar; graceful shutdown onspider_closed - Timeouts, retries surfaced as
IgnoreRequestor similar
Follow-ups
- Optionally attach Markdown (
return_markdown=True) once exporter exists - Network record on error (integration with recorder feature)
- Page bundle snapshot on exception for offline debugging
- WebPoet/Scrapy-Poet provider to inject a
Tabor rendered HTML
Example Spider
class ExampleSpider(scrapy.Spider):
name = "example"
def start_requests(self):
yield scrapy.Request(
"https://example.com/products",
meta={"pydoll": {
"actions": [{"type": "wait", "for": "networkidle"}],
"timeout": 15000
}},
callback=self.parse_list
)
def parse_list(self, response):
for href in response.css(".item a::attr(href)").getall():
yield scrapy.Request(
response.urljoin(href),
meta={"pydoll": {"actions": [{"type": "click", "selector": "#accept"}]}},
callback=self.parse_item
)
def parse_item(self, response):
yield {
"title": response.css("h1::text").get(),
"price": response.css(".price::text").get(),
}
ill take it