pydoll icon indicating copy to clipboard operation
pydoll copied to clipboard

[Feature]: Export HTML to Markdown

Open thalissonvs opened this issue 4 months ago • 2 comments

The main motivation is to make the output more suitable for LLM ingestion, dataset creation, and reproducible text comparisons. Markdown provides a cleaner, more standardized structure compared to raw HTML, which is usually full of layout noise, scripts, and temporary attributes.

Basic behavior

- <h1> → #, <h2> → ##, <h3> → ###
- <p> → simple text line
- <ul>/<ol> → Markdown lists
- <a> → [text](url)
- <img> → ![alt](src) (optional, configurable)
- <pre><code> → fenced code blocks
- Tables converted to Markdown or CSV fallback
- Inline spans or styling without semantic meaning are discarded
- Scripts, styles, and invisible nodes are ignored

Not yet decided how to handle sidebars, navigation blocks, and asides. Options: drop them entirely, append them at the bottom as “Notes,” or let the user configure with include/exclude. Needs discussion.

Likely to be implemented as a separate library (e.g. pydoll-markdown-exporter) to keep Pydoll’s core lightweight. Pydoll will call this library internally. A minimal prototype will be released first, covering essential mappings and already useful for RAG/LLM scenarios.

thalissonvs avatar Aug 22 '25 05:08 thalissonvs

hi @thalissonvs i'm interested on implementing this, is this already being worked on? i could just add a pydoll_markdown_exporter folder for now that is separate from the main pydoll code happy to discuss / iterate

akiusdevo avatar Oct 12 '25 19:10 akiusdevo

Hello @akiusdevo ! Sure, you can work on this, it will be really good

thalissonvs avatar Oct 13 '25 23:10 thalissonvs