pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Image Extraction incomplete

Open neverset123 opened this issue 10 months ago • 2 comments

Describe the bug

A clear and concise description of what the bug is. the image in pdf was detected as multi-images

Have you tried repairing the PDF?

yes Please try running your code with pdfplumber.open(..., repair=True) before submitting a bug report.

Code to reproduce the problem

Paste it here, or attach a Python file. ''' import io import requests

import pdfplumber as pp

SOURCE = 'https://arxiv.org/pdf/2502.13897v1'

response = requests.get(SOURCE) doc = pp.open(io.BytesIO(response.content), repair=True) page = doc.pages[2] image = page.images[0]

page.crop(( image['x0'], image['top'], image['x1'], image['bottom'] )).to_image(resolution=300).save('img.jpg') '''

PDF file

Please attach any PDFs necessary to reproduce the problem. https://arxiv.org/pdf/2502.13897v1 If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

What did you expect the result should have been? figure should be extracted as one complete image

Actual behavior

What actually happened, instead? one figure is cropped into multi incomplete images

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

  • pdfplumber version: [e.g., 0.5.22] latest
  • Python version: [e.g., 3.8.1] py310
  • OS: [e.g., Mac, Linux, etc.] Ubuntu

Additional context

Add any other context/notes about the problem here.

neverset123 avatar Feb 23 '25 18:02 neverset123

Running your code produces this image:

Image

... which seems correct. What seems incorrect about it to you? The page includes 17 images; it may look like a complete figure to the human eye, but appears to be composed differently in the PDF.

jsvine avatar Mar 28 '25 03:03 jsvine

To second @jsvine 's comment - an "image" in a PDF is very often not what you think it is, since PDF readers are also compositors. The only reliable way to extract a figure from a PDF is to do visual layout analysis and then render the PDF, cropping the page to the desired bounding box. You can do this with Docling for instance.

dhdaines avatar Apr 21 '25 13:04 dhdaines