Extracting complex tables from PDFs

Open rpcsteve opened this issue 2 days ago • 0 comments

Preflight Checklist

[x] I have searched existing issues for similar behavior reports
[x] This report does NOT contain sensitive information (API keys, passwords, etc.)

Type of Behavior Issue

Other unexpected behavior

What You Asked Claude to Do

When working with complex PDF tables—especially engineering specifications, multi-column data sheets, or documents with merged cells and nested headers—Claude Code struggles significantly with traditional text extraction methods.

Why text extraction fails on complex tables:

pdftotext and similar tools flatten table structure Column alignment is lost when cells span multiple lines Merged cells become fragmented text Headers get mixed with data rows Unicode characters and special symbols corrupt parsing I was spending hours manually copying data or wrestling with Power Query's "Get Data from PDF" feature, which required:

Manual column delimiter adjustments Row-by-row error correction Repeated imports for multi-page tables Custom M code for edge cases The Solution: WebP Image Conversion

Instead of fighting with text extraction, convert PDF pages to WebP images and let Claude Code's vision capabilities read the table directly.

The Workflow

Convert PDF page(s) to WebP images pdftoppm -webp -r 150 "specs.pdf" "output"

Results in: output-1.webp, output-2.webp, etc. Then simply ask Claude Code to read the image:

Read output-1.webp and extract the table data to CSV format

Why This Works

Claude Code's multimodal capabilities can:

See the actual visual table structure Understand merged cells and spanning headers Preserve column relationships Handle complex nested layouts Process special characters and symbols correctly This works approximately 100x faster than Power Query for complex tables.

Best Practices

Image Quality Settings

150 DPI is usually sufficient for tables pdftoppm -webp -r 150 input.pdf output

For dense/small text, use 200-300 DPI pdftoppm -webp -r 200 input.pdf output

Why WebP over PNG/JPEG?

Smaller file size than PNG (faster upload/processing) Better quality than JPEG (no compression artifacts on text) Claude Code handles WebP natively For Multi-Page Tables

Convert specific pages only pdftoppm -webp -r 150 -f 3 -l 5 input.pdf output # Pages 3-5

Then process sequentially Prompting Tips

Good prompt: Read this image and extract the table to CSV format. The table has headers: Part Number, Size, Dimension A, Dimension B, and Weight. Use commas as delimiters and preserve empty cells.

Better prompt (for complex tables): Extract all table data from this engineering specification sheet.

Maintain the relationship between size designations and their corresponding dimensions Convert fractions to decimals Include units in a separate column if present Output as CSV with headers from the first row Limitations

Very large tables (100+ rows) may need to be split across multiple images Handwritten annotations won't parse well Severely degraded/scanned PDFs may need OCR preprocessing first Claude Code context limits apply to extracted data Tools Required

Install poppler-utils for pdftoppm sudo apt install poppler-utils

Verify installation pdftoppm -v

Conclusion

If you're spending significant time extracting table data from PDFs using Power Query, Python libraries like tabula-py, or manual copy-paste, try the WebP conversion method with Claude Code. For complex tables with irregular structures, it's not just faster—it's more accurate.

The key insight: treat table extraction as a vision problem, not a text parsing problem.

Tested with Claude Code on engineering specification sheets, mfg list price sheets, and it works every time where Claude struggled before.

What Claude Actually Did

Claude failed and struggled to find a solution when asked to extract complex multi-column tables from PDF documents.

Expected Behavior

Claude should know how to extract data from multi-column tables

Files Affected

Permission Mode

Accept Edits was ON (auto-accepting changes)

Can You Reproduce This?

Yes, every time with the same prompt

Steps to Reproduce

No response

Claude Model

Sonnet

Relevant Conversation

Impact

Critical - Data loss or corrupted project

Claude Code Version

latest version

Platform

Anthropic API

Additional Context

No response

Dec 27 '25 18:12 rpcsteve