[Feature]: Export HTML to Markdown
The main motivation is to make the output more suitable for LLM ingestion, dataset creation, and reproducible text comparisons. Markdown provides a cleaner, more standardized structure compared to raw HTML, which is usually full of layout noise, scripts, and temporary attributes.
Basic behavior
- <h1> → #, <h2> → ##, <h3> → ###
- <p> → simple text line
- <ul>/<ol> → Markdown lists
- <a> → [text](url)
- <img> →  (optional, configurable)
- <pre><code> → fenced code blocks
- Tables converted to Markdown or CSV fallback
- Inline spans or styling without semantic meaning are discarded
- Scripts, styles, and invisible nodes are ignored
Not yet decided how to handle sidebars, navigation blocks, and asides. Options: drop them entirely, append them at the bottom as “Notes,” or let the user configure with include/exclude. Needs discussion.
Likely to be implemented as a separate library (e.g. pydoll-markdown-exporter) to keep Pydoll’s core lightweight. Pydoll will call this library internally. A minimal prototype will be released first, covering essential mappings and already useful for RAG/LLM scenarios.
hi @thalissonvs
i'm interested on implementing this, is this already being worked on? i could just add a pydoll_markdown_exporter folder for now that is separate from the main pydoll code
happy to discuss / iterate
Hello @akiusdevo ! Sure, you can work on this, it will be really good