pydoll icon indicating copy to clipboard operation
pydoll copied to clipboard

[Feature]: CLI server mode (pydoll serve)

Open thalissonvs opened this issue 4 months ago • 4 comments

It would be useful to expose Pydoll as an HTTP service, so external systems can trigger crawls without writing Python code. The idea is to add a CLI command:

pydoll serve --port 8000

This spins up a lightweight server (likely as a plugin, to avoid bloating the core) and exposes a simple API.

Proposed API

Initial endpoint:

  • POST /crawl → body contains { "url": "https://example.com", "format": "html" | "markdown" }
  • Response returns the page content, either as HTML or Markdown (depending on the Markdown exporter feature).

This endpoint becomes a foundation for LLM integrations, where the returned HTML or Markdown can be fed into models for structured data extraction. By exposing crawling as a simple web API, Pydoll can be plugged directly into AI pipelines, data labeling flows, or automated extraction systems without extra glue code.

This could start as a separate repository (pydoll-serve) and evolve independently, but integrating a CLI hook into Pydoll keeps the DX simple.

thalissonvs avatar Aug 22 '25 05:08 thalissonvs

As far as I can tell, this feature is a bit more elaborate than that.

The power of pydoll is not just in scraping a single web page, but managing a full context. If you just pull direct web urls (by parsing result HTML pages) you're still not behaving like a human.

I'm suspect that in order to make this worthwhile one will need to interact with a page using the framework, i.e. click on links instead of parsing URLs from HTML. This means an API will need to maintain a session, etc...

nirizr avatar Sep 17 '25 08:09 nirizr

Yeah, this is just an initial idea, y'know. I need to think better about it hehe But if you have suggestions, feel free to comment here, it would be really useful

thalissonvs avatar Sep 21 '25 02:09 thalissonvs

Yeah, this is just an initial idea, y'know. I need to think better about it hehe

Of course.

But if you have suggestions, feel free to comment here, it would be really useful

I forked and made a WIP branch, can be seen at nirizr/pydoll/ .

This is untested and very incomplete initial attempt to tackle web service API functionality. I made this to start getting comments before I put too much into it, so feel free to speak your mind :)

If preferred, I can split most functionality to one of the following:

  1. keep as part of this project, as an optional installation flag
  2. Move to a different repository as a plugin
  3. Create a package that depends on pydoll and simply imports it (pydoll-api or something)

It's currently not tested at all so if I were you I wouldn't bother trying it yourself. I will test it in the upcoming days hopefully, planning to actually use it in the near future.

nirizr avatar Sep 28 '25 07:09 nirizr

I've tested it a bit and it's currently running (and includes docker compose file for easy setup).

The API structure is simple and doesn't support more complex logic but is a good start I think.

I neglected this a bit because I couldn't get pydoll undetected on the website that I'm interested in scraping...

nirizr avatar Nov 11 '25 07:11 nirizr