Image Extraction incomplete
Describe the bug
A clear and concise description of what the bug is. the image in pdf was detected as multi-images
Have you tried repairing the PDF?
yes
Please try running your code with pdfplumber.open(..., repair=True) before submitting a bug report.
Code to reproduce the problem
Paste it here, or attach a Python file. ''' import io import requests
import pdfplumber as pp
SOURCE = 'https://arxiv.org/pdf/2502.13897v1'
response = requests.get(SOURCE) doc = pp.open(io.BytesIO(response.content), repair=True) page = doc.pages[2] image = page.images[0]
page.crop(( image['x0'], image['top'], image['x1'], image['bottom'] )).to_image(resolution=300).save('img.jpg') '''
PDF file
Please attach any PDFs necessary to reproduce the problem. https://arxiv.org/pdf/2502.13897v1 If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.
Expected behavior
What did you expect the result should have been? figure should be extracted as one complete image
Actual behavior
What actually happened, instead? one figure is cropped into multi incomplete images
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
- pdfplumber version: [e.g., 0.5.22] latest
- Python version: [e.g., 3.8.1] py310
- OS: [e.g., Mac, Linux, etc.] Ubuntu
Additional context
Add any other context/notes about the problem here.
Running your code produces this image:
... which seems correct. What seems incorrect about it to you? The page includes 17 images; it may look like a complete figure to the human eye, but appears to be composed differently in the PDF.
To second @jsvine 's comment - an "image" in a PDF is very often not what you think it is, since PDF readers are also compositors. The only reliable way to extract a figure from a PDF is to do visual layout analysis and then render the PDF, cropping the page to the desired bounding box. You can do this with Docling for instance.