Extracting complex tables from PDFs
Preflight Checklist
- [x] I have searched existing issues for similar behavior reports
- [x] This report does NOT contain sensitive information (API keys, passwords, etc.)
Type of Behavior Issue
Other unexpected behavior
What You Asked Claude to Do
When working with complex PDF tables—especially engineering specifications, multi-column data sheets, or documents with merged cells and nested headers—Claude Code struggles significantly with traditional text extraction methods.
Why text extraction fails on complex tables:
pdftotext and similar tools flatten table structure Column alignment is lost when cells span multiple lines Merged cells become fragmented text Headers get mixed with data rows Unicode characters and special symbols corrupt parsing I was spending hours manually copying data or wrestling with Power Query's "Get Data from PDF" feature, which required:
Manual column delimiter adjustments Row-by-row error correction Repeated imports for multi-page tables Custom M code for edge cases The Solution: WebP Image Conversion
Instead of fighting with text extraction, convert PDF pages to WebP images and let Claude Code's vision capabilities read the table directly.
The Workflow
Convert PDF page(s) to WebP images pdftoppm -webp -r 150 "specs.pdf" "output"
Results in: output-1.webp, output-2.webp, etc. Then simply ask Claude Code to read the image:
Read output-1.webp and extract the table data to CSV format
Why This Works
Claude Code's multimodal capabilities can:
See the actual visual table structure Understand merged cells and spanning headers Preserve column relationships Handle complex nested layouts Process special characters and symbols correctly This works approximately 100x faster than Power Query for complex tables.
Best Practices
Image Quality Settings
150 DPI is usually sufficient for tables pdftoppm -webp -r 150 input.pdf output
For dense/small text, use 200-300 DPI pdftoppm -webp -r 200 input.pdf output
Why WebP over PNG/JPEG?
Smaller file size than PNG (faster upload/processing) Better quality than JPEG (no compression artifacts on text) Claude Code handles WebP natively For Multi-Page Tables
Convert specific pages only pdftoppm -webp -r 150 -f 3 -l 5 input.pdf output # Pages 3-5
Then process sequentially Prompting Tips
Good prompt: Read this image and extract the table to CSV format. The table has headers: Part Number, Size, Dimension A, Dimension B, and Weight. Use commas as delimiters and preserve empty cells.
Better prompt (for complex tables): Extract all table data from this engineering specification sheet.
Maintain the relationship between size designations and their corresponding dimensions Convert fractions to decimals Include units in a separate column if present Output as CSV with headers from the first row Limitations
Very large tables (100+ rows) may need to be split across multiple images Handwritten annotations won't parse well Severely degraded/scanned PDFs may need OCR preprocessing first Claude Code context limits apply to extracted data Tools Required
Install poppler-utils for pdftoppm sudo apt install poppler-utils
Verify installation pdftoppm -v
Conclusion
If you're spending significant time extracting table data from PDFs using Power Query, Python libraries like tabula-py, or manual copy-paste, try the WebP conversion method with Claude Code. For complex tables with irregular structures, it's not just faster—it's more accurate.
The key insight: treat table extraction as a vision problem, not a text parsing problem.
Tested with Claude Code on engineering specification sheets, mfg list price sheets, and it works every time where Claude struggled before.
What Claude Actually Did
Claude failed and struggled to find a solution when asked to extract complex multi-column tables from PDF documents.
Expected Behavior
Claude should know how to extract data from multi-column tables
Files Affected
Permission Mode
Accept Edits was ON (auto-accepting changes)
Can You Reproduce This?
Yes, every time with the same prompt
Steps to Reproduce
No response
Claude Model
Sonnet
Relevant Conversation
Impact
Critical - Data loss or corrupted project
Claude Code Version
latest version
Platform
Anthropic API
Additional Context
No response