page.to_image() PDFium: Data format erro
Describe the bug
A clear and concise description of what the bug is.
Have you tried repairing the PDF?
Please try running your code with pdfplumber.open(..., repair=True) before submitting a bug report.
Code to reproduce the problem
Paste it here, or attach a Python file.
PDF file
Please attach any PDFs necessary to reproduce the problem.
If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.
Expected behavior
What did you expect the result should have been?
Actual behavior
What actually happened, instead?
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
- pdfplumber==0.10.4
- Python version: [e.g., 3.11.0]
- OS: docker
Additional context
Add any other context/notes about the problem here.
it's easy to reproduce, two big pdf,and run code:
self.pdf = pdfplumber.open(fnm) if isinstance( fnm, str) else pdfplumber.open(BytesIO(fnm))
self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in enumerate(self.pdf.pages[page_from:page_to])]
I think there's a concurrency issue with 'to_image'
update----- when i try to add lock ,and it works ok
Thank you for raising this issue. Please try updating to the latest version of pdfplumber. Do you still encounter the problem? If so, can you share a fully-reproducible script?
@jwilk
pdfplumber==0.11.1
without lock, when i run it at the same time with two big file, i will get these error
try:
self.pdf = pdfplumber.open(fnm) if isinstance(
fnm, str) else pdfplumber.open(BytesIO(fnm))
self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in
enumerate(self.pdf.pages[page_from:page_to])]
self.page_chars = [[c for c in page.chars if self._has_color(c)] for page in
self.pdf.pages[page_from:page_to]]
self.total_page = len(self.pdf.pages)
except Exception as e:
traceback.print_exc()
logging.error(str(e))
and i add lock ,it work ok
try:
lock.acquire()
self.pdf = pdfplumber.open(fnm) if isinstance(
fnm, str) else pdfplumber.open(BytesIO(fnm))
self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in
enumerate(self.pdf.pages[page_from:page_to])]
self.page_chars = [[c for c in page.chars if self._has_color(c)] for page in
self.pdf.pages[page_from:page_to]]
self.total_page = len(self.pdf.pages)
except Exception as e:
traceback.print_exc()
logging.error(str(e))
finally:
lock.release()
but with lock ,Efficiency is too low
@jsvine could you give me some idea to fix it, i don't know what can i do to improve efficiency
Hi @dalinautoagents, those code snippets reference external unstated variables and also combine image-related processing with other logic, creating an obstacle to reproduction. Could you create a simplified Python script that can be run directly and reproduces the error you're seeing?