PyMuPDF
PyMuPDF copied to clipboard
Contents stream contains floats in scientific notation
Description of the bug
In more recent versions of PyMuPDF, the contents stream can contain (invalid for PDF?) floating point numbers in scientific notation.
For example, these are generated in 1.24.1 (and 1.23):
$ mutool show /tmp/out-my1.24.1.pdf 13
13 0 obj
<<
/Length 54
/Filter /FlateDecode
>>
stream
q
255.36 0 0 328.8 7.62939e-06 0 cm
/fzImg0 Do
Q
endstream
endobj
This is what the same contents sections look like in 1.21.0:
$ mutool show /tmp/out-my1.21.0.pdf 13
13 0 obj
<<
/Length 58
/Filter /FlateDecode
>>
stream
q
255.35999 0 0 328.8 .0000076293949 0 cm
/fzImg0 Do
Q
endstream
endobj
How to reproduce the bug
Apologies up front for not being able to give a simple python script to reproduce the issue. The issue is 100% reproducible, but this requires using my archive-pdf-tools (https://github.com/internetarchive/archive-pdf-tools / https://pypi.org/project/archive-pdf-tools/) tooling. I spent a bit of time trying to make a simple proof of concept but gave up and decided to just file the issue first.
I hope the description in this issue is enough to make someone go 'aha!'.
After installing archive-pdf-tools this command can be used to generate a MRC compressed PDF (input files here: https://wizzup.org/dirlist/pymupdf/):
recode_pdf -o /tmp/out.pdf -m 2 --bg-downsample 2 --dpi 600 --mask-compression jbig2 --hocr-file image00008.hocr -I image00008.jpg
or to do it without jbig2
installed:
recode_pdf -o /tmp/out.pdf -m 2 --bg-downsample 2 --dpi 600 --mask-compression ccitt --hocr-file image00008.hocr -I image00008.jpg
Once the PDF is created, observe that with 1.24 (or 1.23) it's broken:
$ pdfimages -list /tmp/out.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
Syntax Error (18381): Unknown operator 'e-06'
Syntax Error (18392): Too few (1) args to 'cm' operator
Or open in mupdf/evince (this will show an empty page).
The PDF will render OK in mupdf/evince when made with 1.21.
Surprisingly, PDF.js (built-in Firefox PDF renderer) renders both OK.
I also added the two generated PDFs here: https://wizzup.org/dirlist/pymupdf/)
PyMuPDF version
1.24.1
Operating system
Linux
Python version
3.11
Hi. Just wanted to chime in and say that I was the one that originally brought this to @MerlijnWajer's attention and my results are the same as his. When Merlijn mentioned that the Internet Archive (for whom the archive-pdf-tools toolset was created) uses PymuPDF 1.21.0, I had pip install the toolset in a venv with version ==1.21.0 and not the latest, and it rendered correctly.
PyMuPDF version
1.21.0 (in venv), previously had 1.24.0 installed and had the same issue that @MerlijnWajer reproduced.
Operating System
Linux Mint (Debian->Ubuntu-based)
Python version
3.8.10 (in venv)
Thanks for reporting this, we're looking into it. A fix will need a new release of MuPDF, so probably will not be in the next PyMuPDF release.
Thanks! Is there a mupdf ticket that we could follow?
Thanks! Is there a mupdf ticket that we could follow?
There isn't a ticket, but MuPDF master branch commit bfacc4e376012b025 earlier today provides the necessary hook to MuPDF's float-point formatting routine.
This commit will be cherry-picked onto the MuPDF release branch soon i think, and then used by a future PyMuPDF release.
Fixed in 1.24.3.