pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

page.to_image() PDFium: Data format erro

Open dalinautoagents opened this issue 1 year ago • 4 comments

Describe the bug

A clear and concise description of what the bug is.

Have you tried repairing the PDF?

Please try running your code with pdfplumber.open(..., repair=True) before submitting a bug report.

Code to reproduce the problem

Paste it here, or attach a Python file.

PDF file

Please attach any PDFs necessary to reproduce the problem.

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

What did you expect the result should have been?

Actual behavior

What actually happened, instead?

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

  • pdfplumber==0.10.4
  • Python version: [e.g., 3.11.0]
  • OS: docker

Additional context

Add any other context/notes about the problem here.

it's easy to reproduce, two big pdf,and run code:

self.pdf = pdfplumber.open(fnm) if isinstance( fnm, str) else pdfplumber.open(BytesIO(fnm))

self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in enumerate(self.pdf.pages[page_from:page_to])]

I think there's a concurrency issue with 'to_image'

update----- when i try to add lock ,and it works ok

dalinautoagents avatar Jul 30 '24 13:07 dalinautoagents

Thank you for raising this issue. Please try updating to the latest version of pdfplumber. Do you still encounter the problem? If so, can you share a fully-reproducible script?

jsvine avatar Jul 31 '24 22:07 jsvine

@jwilk
pdfplumber==0.11.1

without lock, when i run it at the same time with two big file, i will get these error

 try:
            self.pdf = pdfplumber.open(fnm) if isinstance(
                fnm, str) else pdfplumber.open(BytesIO(fnm))
            self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in
                                enumerate(self.pdf.pages[page_from:page_to])]
            self.page_chars = [[c for c in page.chars if self._has_color(c)] for page in
                               self.pdf.pages[page_from:page_to]]
            self.total_page = len(self.pdf.pages)
        except Exception as e:
            traceback.print_exc()
            logging.error(str(e))

and i add lock ,it work ok

 try:
            lock.acquire()
            self.pdf = pdfplumber.open(fnm) if isinstance(
                fnm, str) else pdfplumber.open(BytesIO(fnm))
            self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in
                                enumerate(self.pdf.pages[page_from:page_to])]
            self.page_chars = [[c for c in page.chars if self._has_color(c)] for page in
                               self.pdf.pages[page_from:page_to]]
            self.total_page = len(self.pdf.pages)
        except Exception as e:
            traceback.print_exc()
            logging.error(str(e))
        finally:
            lock.release()

but with lock ,Efficiency is too low

dalinautoagents avatar Jul 31 '24 23:07 dalinautoagents

@jsvine could you give me some idea to fix it, i don't know what can i do to improve efficiency

dalinautoagents avatar Aug 01 '24 01:08 dalinautoagents

Hi @dalinautoagents, those code snippets reference external unstated variables and also combine image-related processing with other logic, creating an obstacle to reproduction. Could you create a simplified Python script that can be run directly and reproduces the error you're seeing?

jsvine avatar Aug 02 '24 19:08 jsvine