markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Add page-level text extraction for PDF/PPTX/DOCX documents

Open jeonsworld opened this issue 5 months ago • 2 comments

Summary

Adds optional page extraction to PDF, PPTX, and DOCX converters with extract_pages parameter, returning structured page data while maintaining full backward compatibility.

Motivation

Users need to process PDF/PPTX/DOCX pages separately and know which content comes from which page for page-aware applications. Additionally, local development settings should not be tracked in version control.

Changes

  • New PageInfo class: Stores page number and content
  • Enhanced DocumentConverterResult: Added optional pages attribute
  • Extended converters: Added extract_pages parameter for page-by-page processing in PDF, PPTX, and DOCX converters
  • CLI support: Added --extract-pages and --pages-json flags
  • Comprehensive tests: Test cases covering all scenarios for each format

Usage

Python API

# Traditional (unchanged)
result = md.convert("doc.pdf")

# New page extraction - works for PDF, PPTX, and DOCX
result = md.convert("doc.pdf", extract_pages=True)
result = md.convert("presentation.pptx", extract_pages=True)
result = md.convert("document.docx", extract_pages=True)

for page in result.pages:
    print(f"Page {page.page_number}: {page.content}")

CLI

# Extract pages with JSON output
markitdown doc.pdf --extract-pages --pages-json
markitdown presentation.pptx --extract-pages --pages-json
markitdown document.docx --extract-pages --pages-json

Resolved #210 #122

jeonsworld avatar May 23 '25 07:05 jeonsworld