[Feature]: CLI server mode (pydoll serve)
It would be useful to expose Pydoll as an HTTP service, so external systems can trigger crawls without writing Python code. The idea is to add a CLI command:
pydoll serve --port 8000
This spins up a lightweight server (likely as a plugin, to avoid bloating the core) and exposes a simple API.
Proposed API
Initial endpoint:
- POST /crawl → body contains { "url": "https://example.com", "format": "html" | "markdown" }
- Response returns the page content, either as HTML or Markdown (depending on the Markdown exporter feature).
This endpoint becomes a foundation for LLM integrations, where the returned HTML or Markdown can be fed into models for structured data extraction. By exposing crawling as a simple web API, Pydoll can be plugged directly into AI pipelines, data labeling flows, or automated extraction systems without extra glue code.
This could start as a separate repository (pydoll-serve) and evolve independently, but integrating a CLI hook into Pydoll keeps the DX simple.
As far as I can tell, this feature is a bit more elaborate than that.
The power of pydoll is not just in scraping a single web page, but managing a full context. If you just pull direct web urls (by parsing result HTML pages) you're still not behaving like a human.
I'm suspect that in order to make this worthwhile one will need to interact with a page using the framework, i.e. click on links instead of parsing URLs from HTML. This means an API will need to maintain a session, etc...
Yeah, this is just an initial idea, y'know. I need to think better about it hehe But if you have suggestions, feel free to comment here, it would be really useful
Yeah, this is just an initial idea, y'know. I need to think better about it hehe
Of course.
But if you have suggestions, feel free to comment here, it would be really useful
I forked and made a WIP branch, can be seen at nirizr/pydoll/ .
This is untested and very incomplete initial attempt to tackle web service API functionality. I made this to start getting comments before I put too much into it, so feel free to speak your mind :)
If preferred, I can split most functionality to one of the following:
- keep as part of this project, as an optional installation flag
- Move to a different repository as a plugin
- Create a package that depends on pydoll and simply imports it (
pydoll-apior something)
It's currently not tested at all so if I were you I wouldn't bother trying it yourself. I will test it in the upcoming days hopefully, planning to actually use it in the near future.
I've tested it a bit and it's currently running (and includes docker compose file for easy setup).
The API structure is simple and doesn't support more complex logic but is a good start I think.
I neglected this a bit because I couldn't get pydoll undetected on the website that I'm interested in scraping...