pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Troubles with images extraction

Open rerik opened this issue 1 year ago • 5 comments

Describe the bug

It's 2-in-1 problem.

At first, image raw data (for example, doc.pages[1].images[0]['stream'].rawdata) is broken. PIL Image PIL.Image.open(io.BytesIO(doc.pages[1].images[0]['stream'].rawdata)) except an error {UnidentifiedImageError}UnidentifiedImageError('cannot identify image file <_io.BytesIO object at 0x7f62058f5df0>'). If to save image bytes directly, it's just broken and cannot be opened.

with open ('image.jpg', 'wb') as file:
    file.write(doc.pages[1].images[0]['stream'].rawdata))

I've tried get raw bytes of this image with pypdf lib. It contains ~2 times more bytes and can be eazely saved, so it's not a principial problem of image itself.

At second, if I try to save crop by this image bbox, it miss.

doc.pages[1].crop((
    doc.pages[1].images[0]['x0'],
    doc.pages[1].images[0]['top'], 
    doc.pages[1].images[0]['x1'], 
    doc.pages[1].images[0]['bottom']
)).to_image(resolution=300).save('img.jpg')

This code saves img_5 Instead of img_4

Have you tried repairing the PDF?

Yes, I've tryied. In this case it's just crush with opening:

Traceback (most recent call last):
  File "/home/alex/AlanNLP/qna/test.py/test_new_pdf_parser.py", line 60, in <module>
    result = parse(io.BytesIO(response.content), images_dir, images_url, SOURCE, pages_cache_file=PAGES_CACHE, print_progress=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alex/AlanNLP/qna/src.py/pdf_parser_new/parser.py", line 397, in parse
    doc = pp.open(file, repair=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alex/AlanNLP/venv/3.11/lib/python3.11/site-packages/pdfplumber/pdf.py", line 84, in open
    stream = _repair(
             ^^^^^^^^
  File "/home/alex/AlanNLP/venv/3.11/lib/python3.11/site-packages/pdfplumber/repair.py", line 58, in _repair
    raise Exception(f"{stderr.decode('utf-8')}")
Exception: GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
   **** Error: Incorrect object count in object stream.
               Output may be incorrect.
Error: /rangecheck in resolveobjectstream
Operand stack:
   (/tmp/gs_ujCNhG)   --nostringval--   --dict:1/100(L)--   2511   4207859   13   2511   3645   --dict:8/15(L)--   150   --nostringval--   163   --nostringval--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:2/2(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:7/7(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:3/3(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:3/3(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:7/7(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:3/3(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:3/3(L)--   --dict:7/7(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:3/3(L)--   --dict:4/4(L)--   --dict:3/3(L)--   --dict:8/8(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:8/8(L)--   --dict:8/8(L)--   --dict:5/5(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--   --dict:5/5(L)--   --dict:4/4(L)--
Execution stack:
   %interp_exit   .runexec2   --nostringval--   runpdf   --nostringval--   2   %stopped_push   --nostringval--   runpdf   runpdf   false   1   %stopped_push   1990   1   3   %oparray_pop   1989   1   3   %oparray_pop   1977   1   3   %oparray_pop   runpdf   1978   3   3   %oparray_pop   runpdf   runpdf   runpdf   runpdf   runpdf   runpdf   runpdf   runpdf
Dictionary stack:
   --dict:731/1123(ro)(G)--   --dict:1/20(G)--   --dict:80/200(L)--   --dict:80/200(L)--   --dict:135/256(ro)(G)--   --dict:315/325(ro)(G)--   --dict:29/32(L)--
Current allocation mode is local
GPL Ghostscript 9.50: Unrecoverable error, exit code 1

Environment

  • pdfplumber version: 0.11.4
  • Python version: 3.11.9
  • OS: Ubuntu 20.04.6 LTS on Windows 10 x86_64

rerik avatar Sep 26 '24 16:09 rerik

Hi @rerik, and thanks for your interest in pdfplumber. Can you share the PDF and a minimal Python script that reproduces the problem?

jsvine avatar Oct 03 '24 02:10 jsvine

Hi @rerik, and thanks for your interest in pdfplumber. Can you share the PDF and a minimal Python script that reproduces the problem?

Oh, I'm sorry, it's my bad. I was absolutely sure I gave the link to the target file: https://storage.googleapis.com/alan-ai-knowledge-base/isu-knowledge-base/Book/fundamentals-book-1-6.pdf

Minimal Python script to reproduce:

import io
import requests

import pdfplumber as pp


SOURCE = 'https://storage.googleapis.com/alan-ai-knowledge-base/isu-knowledge-base/Book/fundamentals-book-1-6.pdf'

response = requests.get(SOURCE)
doc = pp.open(io.BytesIO(response.content))
page = doc.pages[1]
image = page.images[0]

page.crop((
    image['x0'],
    image['top'], 
    image['x1'], 
    image['bottom']
)).to_image(resolution=300).save('img.jpg')

rerik avatar Oct 03 '24 10:10 rerik

Thank you, this is very helpful. I can reproduce the issue, and will see if I can find a solution.

jsvine avatar Oct 03 '24 12:10 jsvine

I have similar problem when I tried to read stream using PIL. with pdfplumber.open("example.pdf") as pdf: # print(pdf.pages) page = pdf.pages[0] # Extract the first page for page in pdf.pages: positions = [] for im in page.images: p = (im['x0'], im['top'], im['x1'], im['bottom']) print(im) image_data = im['stream'].get_data() pil_image = Image.open(io.BytesIO(image_data)) positions.append(p) print('positions', positions)

pil_image = Image.open(io.BytesIO(image_data))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x13ff37380>

This is the image that caused the problem: {'x0': 12.0, 'y0': 28.32494, 'x1': 780.0, 'y1': 583.67504, 'width': 768.0, 'height': 555.3501, 'stream': <PDFStream(75): raw=241206, {'BitsPerComponent': 8, 'ColorSpace': PDFObjRef:76, 'Filter': /'FlateDecode', 'Height': 632, 'Interpolate': True, 'Length': 241206, 'Subtype': /'Image', 'Type': /'XObject', 'Width': 874}>, 'srcsize': (874, 632), 'imagemask': None, 'bits': 8, 'colorspace': [[/'ICCBased', <PDFStream(77): raw=3172, {'Alternate': /'DeviceRGB', 'Filter': /'FlateDecode', 'Length': 3172, 'N': 3}>]], 'mcid': None, 'tag': None, 'object_type': 'image', 'page_number': 8, 'top': 28.324960000000033, 'bottom': 583.67506, 'doctop': 4412.32496}

Looks like that all PDFstreams with 'Filter': /'FlateDecode' have the problem. Maybe I am wrong but all streams in my testcase with 'Filter': /'DCTDecode' are good.

JiachengSun0520 avatar Dec 06 '24 18:12 JiachengSun0520

solved it by using PIL.Image.frombytes()

JiachengSun0520 avatar Dec 09 '24 19:12 JiachengSun0520