PyMuPDF
PyMuPDF copied to clipboard
adding text with fontname="Helvetica" can silently fail
I've been porting to the "new" TextWriter (which is nice BTW!)
Found the following behaviour, when using fontname="Helvetica"
(which on reading the docs, doesn't look right but has been working for me for years).
Anyway, consider this MWE:
from fitz import *
doc = fitz.open()
pg = doc.new_page()
excess = pg.insert_textbox(Rect(0,50,100,150), "hello helvetica", fontname="Helvetica")
print(excess)
assert excess > 0
doc.save("foo.pdf")
Now,that works, excess
is 88.17499947547913
. But try the following with some other text with fontname="helv"
(which seems more correct from reading the docs!)
from fitz import *
doc = fitz.open()
pg = doc.new_page()
# now we "correctly" specify the font here
excess = pg.insert_textbox(Rect(0,0,100,100), "hello helv", fontname="helv")
print(excess)
assert excess > 0
excess = pg.insert_textbox(Rect(0,50,100,150), "hello helvetica", fontname="Helvetica")
print(excess)
assert excess > 0
# alternatively, or as well, we can use the TextWriter
r = Rect(0, 100, 300, 200)
tw = TextWriter(pg.rect)
tw.append(Point(100, 40), "TextWriter says hello", fontsize=18, font=Font("helv"))
pg.write_text(rect=r, writers=tw)
pg.draw_rect(r)
doc.save("foo.pdf")
In this case, excess
prints twice. But the "hello helvetica" string is not there in foo.pdf.
Part of me thinks "Well garbage in garbage out so RTFM!", but two things are strange:
-
fontname="Helvetica"
is sometimes ok but not other times. - The text seems to be written correctly (positive return value).
I guess this is somehow related to this note;
For an existing font of the page, use its reference name as fontname (this is item[4] of its entry in Page.get_fonts()).
b/c pg.get_fonts()
changes as follows in the two examples above:
[(5, 'n/a', 'Type1', 'Helvetica', 'Helvetica', 'WinAnsiEncoding')]
[(5, 'n/a', 'Type1', 'Helvetica', 'helv', 'WinAnsiEncoding')]
But the "hello helvetica" string is not there in foo.pdf.
How did you determine that? Looking at the page with some viewer, or is is not being extracted?
I reproduced your case and found the following:
r=fitz.Rect(100,100,300,200)
page.insert_textbox(r,"hello helv", fontname="helv",color=(1,0,0))
88.17499947547913
page.insert_textbox(r,"hello Helvetica", fontname="Helvetica",color=(0,1,0))
88.17499947547913
doc.save("x.pdf")
Text extraction (get_text()) delivers correctly
hello helv
hello Helvetica
But the viewers are getting confused in various ways:
- Adobe Acrobat correctly shows
- so do the (unknown) PDF plugins of the two browsers I am using, Firefox and MS Edge
- so do Nitro PDF, Foxit Reader
- SumatraPDF shows boths texts, but the "Helvetica" one in Times-Roman. Same thing for the MuPDF viewer - of course, because SumatraPDF is built with MuPDF.
- PDF-XChange viewer only shows the "helv" text - just like
evince
on Ubuntu (which you may have been using too).
The 2 contents of the PDF page look like this
print(doc.xref_stream(6))
b'\nq\nBT\n1 0 0 1 100 730.175 Tm /helv 11 Tf 1 0 0 RG 1 0 0 rg [<68656c6c6f2068656c76>]TJ\nET\nQ\n'
print(doc.xref_stream(7))
b'\nq\nBT\n1 0 0 1 100 730.175 Tm /Helvetica 11 Tf 0 1 0 RG 0 1 0 rg [<68656c6c6f2048656c766574696361>]TJ\nET\nQ\n'
Which is correct. The problem seems to be my logic which looks up existing fonts of page to avoid multiple insertions of the same font. Because every string like "helv" and "Helvetica" (with arbitrary upper/lower case choosing) all lead to the same built-in font "Helvetica", I don't insert a font for the second textbox insertion - but I fail to insert a second reference to that font when there is a different reference name ("Helvetica").
BTW the TextWriter
class does not insert the same font as does insert_text*
. TextWriter always uses embedded font files - according to the general recommendation, not to relinquish the control to the PDF viewer as far as appearance is concerned.
The "helv" font code used in TextWriter leads to using a font file that looks exactly like Helvetica - but it is not the same, and you will also see a larger PDF size because of the embedded file. This implies that using the bold, italic, bold-italic versions may lead to up to 4 embedded files - not too large ones, though.
Also the reference names (which are chosen by the underlying MuPDF code) always are like Fnnn
and never helv
.
Confirmed suspicions: I was too stingy and assumed a font as a duplicate already when the basefont name is equal - as opposed to that the font reference name must exactly match. Will be corrected in next version.
Thanks for spotting this!
How did you determine that? Looking at the page with some viewer
Correct, I was using xpdf
and Evince
(I'd forgotten those are the same renderer) on Fedora, and only looking visually at the results.
I think I followed all that. One nagging doubt: suppose I use TextWriter first and then fontname="Helvetica"
. If TextWriter is embedding its own font, should insert_textbox
really be deduping over that as well?
In particular, the following MWE shows no hello helvetica
in Evince:
from fitz import *
doc = fitz.open()
pg = doc.new_page()
r = Rect(0, 100, 300, 200)
tw = TextWriter(pg.rect)
tw.append(Point(100, 40), "TextWriter says helv", fontsize=18, font=Font("helv"))
pg.write_text(rect=r, writers=tw)
pg.draw_rect(r)
excess = pg.insert_textbox(Rect(0,50,100,150), "hello helvetica", fontname="Helvetica")
print(excess)
assert excess > 0
print(pg.get_fonts())
doc.save("foo.pdf")
Hm, this is what my evince
shows:
... as expected and as it should.
TW and insert_textbox do not interfere at all - the fonts they use are completely different.
You are just obscuring the situation by the way you are using TW: you do page.write_text
instead of TextWriter.write_text
.
This creates an additional internal PDF page, which is then embedded in yours. So there is a font used by that internal PDF page which you can display if looking at page.get_fonts(True)
:
88.17499947547913
[(16, 'n/a', 'Type1', 'Helvetica', 'Helvetica', 'WinAnsiEncoding', 0),
(6, 'cid', 'Type0', 'Helvetica', 'F0', 'Identity-H', 12)]
The True
option causes a recursive search down to each embedded XObject.
The cid font is that of the TextWriter, which references it via name "F0". The internal PDF page is stored in XObject at xref 12.
That is not what I see:
88.17499947547913
[(6, 'cid', 'Type0', 'Helvetica', 'F0', 'Identity-H')]
file: foo.pdf
In [3]: fitz.version
Out[3]: ('1.20.2', '1.20.3', '20220813000001')
Interestingly, in xpdf
I see this:
Real old school, in gv
, I get:
This is the same error that I spotted.
If you want you can do a quick fix in your PyMuPDF installation. Search for:
and remove / comment-out the yellow lines.
This is in file fitz.py
Oh, ok so your installation is from main branch or somewhere that has that fix?
I don't need a quick fix, just trying to help.
This is the fix that will be published with the next version. Those two lines are causing the error you were referring to.
Fixed in 1.21.0