File support: chm support
Would this work for chm help files?
Two options.
- Convert your chm to html. Then pass html to markitdown.
markitdown htmlfromchm.html - Add a
DocumentConvertorfor chm to the library
Can I contribute to this?
🔧 Problem Summary:
CHM (Compiled HTML Help) files are still widely used in legacy documentation systems and Windows-based software. Currently, MarkItDown doesn't support parsing or converting .chm files into Markdown. Adding support for .chm files would significantly improve the flexibility of this tool when dealing with older help documentation formats.
✅ Proposed Solution:
I propose implementing a module that extracts .chm content, converts the internal HTML pages into Markdown, and packages them into the standard MarkItDown processing pipeline.
🛠️ Implementation Plan:
Read .chm Files:
Use Python bindings like pychm or wrap a system-level parser like chmlib.
Extract the Table of Contents (TOC), topic files, and internal HTML pages.
Convert HTML to Markdown:
Use libraries like html2text or markdownify to convert each HTML page into clean Markdown syntax.
Optionally preserve internal links, headers, and styles in a user-friendly way.
Integrate with MarkItDown Pipeline:
Add a new CLI input handler for .chm files.
Convert the extracted Markdown content into standard MarkItDown output formats (e.g., .md, .html, .pdf).
Testing:
Create a small .chm sample file for testing.
Add unit tests for file extraction, HTML parsing, and Markdown output.
Test edge cases: corrupted .chm, multilingual CHM files, or those with heavy formatting.
Documentation:
Update the README and CLI usage docs to mention .chm support.
Add notes about any library dependencies (like libchm or pychm).
🧠 Why This Matters:
.chm support bridges the gap between legacy formats and modern Markdown workflows.
It expands MarkItDown’s use cases into software documentation migration, archival research, and technical content revival.
Working with binary help formats shows technical depth and builds resilience in file parsing logic.
📝 Notes:
The implementation can be modular to allow future support for similar compiled help formats like .hlp or .hhp.
For large .chm files, consider chunking the content into multiple .md files, one per topic or chapter.
Users should be able to optionally export the entire .chm as a zip archive of .md files.
🚀 Ready to Start: If this approach aligns with project goals, I’d love to submit a PR implementing this feature. Open to feedback or alternative suggestions!
I tried implementing this, check out PR #1367