`TypeError` raised by `extract_text` method with compressed PDF file
Bug report
Description
I'm generating PDF document through Weasyprint. Since the version 59.0 of this package, I'm not able to extract text from generated compressed PDF files with pdfminer.highlevel.extract_text method. Indeed this method raises a TypeError, invalid length. The exception is raised from a util method called nunpack.
So I first open an issue on the Weasyprint repository, but it appears the issue's source could be come from pdfminer itself.
You can take a look to the answer of Weasyprint maintainer, to understand pdfminer concern in this problem.
Steps to reproduce
from io import BytesIO
from pdfminer.high_level import extract_text
from weasyprint import HTML
html = HTML(string='<h1>Hello world</h1>')
document = html.write_pdf()
extract_text(BytesIO(document)) # 💥 TypeError: invalid length: 6
Here’s a simple and uncompressed PDF to reproduce the problem, in case you’d like to avoid installing another tool 😄: hello.pdf
The error is caused by the XRef table with /W [1 4 6]. The third field is encoded using 6 bytes, and it’s decoded here using nunpack that’s not designed to handle all integer sizes.
Instead of using struct.unpack in nunpack, it may be useful to use int.from_bytes that will automatically work for all integer sizes.
fixed in #1029 (and thank you for weasyprint, it is very nice software!)