PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

Contents stream contains floats in scientific notation

Open MerlijnWajer opened this issue 10 months ago • 3 comments

Description of the bug

In more recent versions of PyMuPDF, the contents stream can contain (invalid for PDF?) floating point numbers in scientific notation.

For example, these are generated in 1.24.1 (and 1.23):

$ mutool show /tmp/out-my1.24.1.pdf 13
13 0 obj
<<
  /Length 54
  /Filter /FlateDecode
>>
stream

q
255.36 0 0 328.8 7.62939e-06 0 cm
/fzImg0 Do
Q
endstream
endobj

This is what the same contents sections look like in 1.21.0:

$ mutool show /tmp/out-my1.21.0.pdf 13
13 0 obj
<<
  /Length 58
  /Filter /FlateDecode
>>
stream

q
255.35999 0 0 328.8 .0000076293949 0 cm
/fzImg0 Do
Q
endstream
endobj

How to reproduce the bug

Apologies up front for not being able to give a simple python script to reproduce the issue. The issue is 100% reproducible, but this requires using my archive-pdf-tools (https://github.com/internetarchive/archive-pdf-tools / https://pypi.org/project/archive-pdf-tools/) tooling. I spent a bit of time trying to make a simple proof of concept but gave up and decided to just file the issue first.

I hope the description in this issue is enough to make someone go 'aha!'.

After installing archive-pdf-tools this command can be used to generate a MRC compressed PDF (input files here: https://wizzup.org/dirlist/pymupdf/):

recode_pdf -o /tmp/out.pdf -m 2 --bg-downsample 2 --dpi 600 --mask-compression jbig2 --hocr-file image00008.hocr -I image00008.jpg

or to do it without jbig2 installed:

recode_pdf -o /tmp/out.pdf -m 2 --bg-downsample 2 --dpi 600 --mask-compression ccitt --hocr-file image00008.hocr -I image00008.jpg

Once the PDF is created, observe that with 1.24 (or 1.23) it's broken:

$ pdfimages -list /tmp/out.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
Syntax Error (18381): Unknown operator 'e-06'
Syntax Error (18392): Too few (1) args to 'cm' operator

Or open in mupdf/evince (this will show an empty page).

The PDF will render OK in mupdf/evince when made with 1.21.

Surprisingly, PDF.js (built-in Firefox PDF renderer) renders both OK.

I also added the two generated PDFs here: https://wizzup.org/dirlist/pymupdf/)

PyMuPDF version

1.24.1

Operating system

Linux

Python version

3.11

MerlijnWajer avatar Apr 14 '24 15:04 MerlijnWajer

Hi. Just wanted to chime in and say that I was the one that originally brought this to @MerlijnWajer's attention and my results are the same as his. When Merlijn mentioned that the Internet Archive (for whom the archive-pdf-tools toolset was created) uses PymuPDF 1.21.0, I had pip install the toolset in a venv with version ==1.21.0 and not the latest, and it rendered correctly.

PyMuPDF version

1.21.0 (in venv), previously had 1.24.0 installed and had the same issue that @MerlijnWajer reproduced.

Operating System

Linux Mint (Debian->Ubuntu-based)

Python version

3.8.10 (in venv)

TDavLinguist avatar Apr 14 '24 15:04 TDavLinguist

Thanks for reporting this, we're looking into it. A fix will need a new release of MuPDF, so probably will not be in the next PyMuPDF release.

Thanks! Is there a mupdf ticket that we could follow?

MerlijnWajer avatar Apr 16 '24 20:04 MerlijnWajer

Thanks! Is there a mupdf ticket that we could follow?

There isn't a ticket, but MuPDF master branch commit bfacc4e376012b025 earlier today provides the necessary hook to MuPDF's float-point formatting routine.

This commit will be cherry-picked onto the MuPDF release branch soon i think, and then used by a future PyMuPDF release.

Fixed in 1.24.3.