pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

`TypeError` raised by `extract_text` method with compressed PDF file

Open jbpenrath opened this issue 2 years ago • 2 comments

Bug report

Description

I'm generating PDF document through Weasyprint. Since the version 59.0 of this package, I'm not able to extract text from generated compressed PDF files with pdfminer.highlevel.extract_text method. Indeed this method raises a TypeError, invalid length. The exception is raised from a util method called nunpack.

So I first open an issue on the Weasyprint repository, but it appears the issue's source could be come from pdfminer itself.

You can take a look to the answer of Weasyprint maintainer, to understand pdfminer concern in this problem.

Steps to reproduce

from io import BytesIO
from pdfminer.high_level import extract_text
from weasyprint import HTML

html = HTML(string='<h1>Hello world</h1>')
document = html.write_pdf()
extract_text(BytesIO(document)) # 💥 TypeError: invalid length: 6

jbpenrath avatar May 22 '23 14:05 jbpenrath

Here’s a simple and uncompressed PDF to reproduce the problem, in case you’d like to avoid installing another tool 😄: hello.pdf

The error is caused by the XRef table with /W [1 4 6]. The third field is encoded using 6 bytes, and it’s decoded here using nunpack that’s not designed to handle all integer sizes.

Instead of using struct.unpack in nunpack, it may be useful to use int.from_bytes that will automatically work for all integer sizes.

liZe avatar May 23 '23 20:05 liZe

fixed in #1029 (and thank you for weasyprint, it is very nice software!)

dhdaines avatar Aug 01 '24 13:08 dhdaines