markitdown
markitdown copied to clipboard
Add page-level text extraction for PDF/PPTX/DOCX documents
Summary
Adds optional page extraction to PDF, PPTX, and DOCX converters with extract_pages parameter, returning structured page data while maintaining full backward compatibility.
Motivation
Users need to process PDF/PPTX/DOCX pages separately and know which content comes from which page for page-aware applications. Additionally, local development settings should not be tracked in version control.
Changes
- New PageInfo class: Stores page number and content
- Enhanced DocumentConverterResult: Added optional pages attribute
- Extended converters: Added extract_pages parameter for page-by-page processing in PDF, PPTX, and DOCX converters
- CLI support: Added --extract-pages and --pages-json flags
- Comprehensive tests: Test cases covering all scenarios for each format
Usage
Python API
# Traditional (unchanged)
result = md.convert("doc.pdf")
# New page extraction - works for PDF, PPTX, and DOCX
result = md.convert("doc.pdf", extract_pages=True)
result = md.convert("presentation.pptx", extract_pages=True)
result = md.convert("document.docx", extract_pages=True)
for page in result.pages:
print(f"Page {page.page_number}: {page.content}")
CLI
# Extract pages with JSON output
markitdown doc.pdf --extract-pages --pages-json
markitdown presentation.pptx --extract-pages --pages-json
markitdown document.docx --extract-pages --pages-json
Resolved #210 #122