PyMuPDF
PyMuPDF copied to clipboard
subset_fonts error exit without exception/warning
Description of the bug
in the new PyMUPDF 1.24.3, if any error in doc.subset_fonts(), the process will end without any warning or error number. doc.subset_fonts() Error will be raised in PyMUPdf 1.23.26.
How to reproduce the bug
In PyMUPdf 1.23.26 Traceback (most recent call last): File "C:_a\PDF_Searchable_v1.py", line 346, in pdfSearhable4 doc.subset_fonts() File "C:\Users\6\AppData\Local\Programs\Python\Python310\lib\site-packages\fitz\utils.py", line 5631, in subset_fonts width_table, def_width = get_old_widths(font_xref) File "C:\Users\6\AppData\Local\Programs\Python\Python310\lib\site-packages\fitz\utils.py", line 5350, in get_old_widths df_xref = int(df[1][1:-1].replace("0 R", "")) ValueError: invalid literal for int() with base 10: '<</BaseFont/CIDFont+F1/CIDSystemInfo<</Ordering 97 /Registry 98 /Supplement 0>>/CIDToGIDMap/Identity/FontDescriptor<</Ascent 952/CapHeight 631/Descent -268/Flags 6/FontBBox 99 /FontFile2 100 /FontNam
PyMuPDF version
1.24.3
Operating system
Windows
Python version
3.10
This post cannot be accepted with a reproducing file.
To circumvent an urgent situation, please use argument fallback=True
.
try to run doc.subset_fonts in the attached file will create an error in an 1 - Copy.pdf earlier version.
Under with fallback, the doc.subset_fonts will raise the same error.
Under new version(without fallback), the error will not be raised, but the file doc.save after doc.subset_fonts will scramble the words.
I can reproduce the previous comment:
In [2]: fitz.version
Out[2]: ('1.23.3', '1.23.2', '20230831000001')
In [3]: d = fitz.open("1.-.Copy.pdf")
In [4]: d.subset_fonts()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[4], line 1
----> 1 d.subset_fonts()
File /usr/lib64/python3.12/site-packages/fitz/utils.py:5448, in subset_fonts(doc, verbose)
5445 # walk through the original font xrefs and replace each by the subset def
5446 for font_xref in xref_set:
5447 # we need the original '/W' and '/DW' width values
-> 5448 width_table, def_width = get_old_widths(font_xref)
5449 # ... and replace original font definition at xref with it
5450 doc.update_object(font_xref, font_str)
File /usr/lib64/python3.12/site-packages/fitz/utils.py:5175, in subset_fonts.<locals>.get_old_widths(xref)
5173 if df[0] != "array": # only handle xref specifications
5174 return None, None
-> 5175 df_xref = int(df[1][1:-1].replace("0 R", ""))
5176 widths = doc.xref_get_key(df_xref, "W")
5177 if widths[0] != "array": # no widths key found
ValueError: invalid literal for int() with base 10: '<</BaseFont/CIDFont+F1/CIDSystemInfo<</Ordering 13 /Registry 14 /Supplement 0>>/CIDToGIDMap/Identity/FontDescriptor<</Ascent 952/CapHeight 631/Descent -268/Flags 6/FontBBox 15 /FontFile2 16 /FontName
But with 1.24.3, I get no error and upon save I see scrambled words:
The MuPDF team has developed a fix which I am currently testing.
Update: fix developed.
I have a possibly-related issue where 1.24.3 leaves some misc chars on the page, which go away if I stop using subset_fonts. Haven't narrowed it down to a MWE yet, but one difference is I DO NOT get an error with older pymupdf: so it might not be quite the same issue... More to follow.
Downstream issue: https://gitlab.com/plom/plom/-/issues/3374
Fixed in 1.24.6.