pdfplumber When I set repair=true,there is an error:'utf-8' codec can't decode byte 0xae in position 239: invalid start byte.Because of the original PDF?

When I set repair=true,there is an error:'utf-8' codec can't decode byte 0xae in position 239: invalid start byte.Because of the original PDF?

Open zyc1128 opened this issue 8 months ago • 1 comments

Describe the bug

A clear and concise description of what the bug is.

And When I use pages.page.char[x]["text"] to get contens by single char,some texts from tables have been lost.I also find there is no bytes_like of the key of image object,how can I save images in the PDF to local?

Have you tried repairing the PDF?

Please try running your code with pdfplumber.open(..., repair=True) before submitting a bug report.

Code to reproduce the problem

Paste it here, or attach a Python file.

PDF file

Please attach any PDFs necessary to reproduce the problem.

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

What did you expect the result should have been?

Actual behavior

What actually happened, instead?

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

pdfplumber version: [e.g., 0.5.22]
Python version: [e.g., 3.8.1]
OS: [e.g., Mac, Linux, etc.]

Additional context

Add any other context/notes about the problem here.

May 30 '24 02:05 zyc1128

pdfplumber pdfplumber copied to clipboard

When I set repair=true,there is an error:'utf-8' codec can't decode byte 0xae in position 239: invalid start byte.Because of the original PDF?

Describe the bug

Have you tried repairing the PDF?

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Screenshots

Environment

Additional context

pdfplumber
pdfplumber copied to clipboard