PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

adding text with fontname="Helvetica" can silently fail

Open cbm755 opened this issue 2 years ago • 15 comments

I've been porting to the "new" TextWriter (which is nice BTW!)

Found the following behaviour, when using fontname="Helvetica" (which on reading the docs, doesn't look right but has been working for me for years).

Anyway, consider this MWE:

from fitz import *

doc = fitz.open()
pg = doc.new_page()

excess = pg.insert_textbox(Rect(0,50,100,150), "hello helvetica", fontname="Helvetica")
print(excess)
assert excess > 0

doc.save("foo.pdf")

Now,that works, excess is 88.17499947547913. But try the following with some other text with fontname="helv" (which seems more correct from reading the docs!)

from fitz import *

doc = fitz.open()
pg = doc.new_page()

# now we "correctly" specify the font here
excess = pg.insert_textbox(Rect(0,0,100,100), "hello helv", fontname="helv")
print(excess)
assert excess > 0

excess = pg.insert_textbox(Rect(0,50,100,150), "hello helvetica", fontname="Helvetica")
print(excess)
assert excess > 0

# alternatively, or as well, we can use the TextWriter
r = Rect(0, 100, 300, 200)
tw = TextWriter(pg.rect)
tw.append(Point(100, 40), "TextWriter says hello", fontsize=18, font=Font("helv"))
pg.write_text(rect=r, writers=tw)
pg.draw_rect(r)

doc.save("foo.pdf")

In this case, excess prints twice. But the "hello helvetica" string is not there in foo.pdf.


Part of me thinks "Well garbage in garbage out so RTFM!", but two things are strange:

  1. fontname="Helvetica" is sometimes ok but not other times.
  2. The text seems to be written correctly (positive return value).

cbm755 avatar Sep 08 '22 04:09 cbm755

I guess this is somehow related to this note;

For an existing font of the page, use its reference name as fontname (this is item[4] of its entry in Page.get_fonts()).

b/c pg.get_fonts() changes as follows in the two examples above:

[(5, 'n/a', 'Type1', 'Helvetica', 'Helvetica', 'WinAnsiEncoding')]
[(5, 'n/a', 'Type1', 'Helvetica', 'helv', 'WinAnsiEncoding')]

cbm755 avatar Sep 08 '22 05:09 cbm755

But the "hello helvetica" string is not there in foo.pdf.

How did you determine that? Looking at the page with some viewer, or is is not being extracted?

I reproduced your case and found the following:

r=fitz.Rect(100,100,300,200)
page.insert_textbox(r,"hello helv", fontname="helv",color=(1,0,0))
88.17499947547913
page.insert_textbox(r,"hello Helvetica", fontname="Helvetica",color=(0,1,0))
88.17499947547913
doc.save("x.pdf")

Text extraction (get_text()) delivers correctly

hello helv
hello Helvetica

But the viewers are getting confused in various ways:

  • Adobe Acrobat correctly shows grafik
  • so do the (unknown) PDF plugins of the two browsers I am using, Firefox and MS Edge
  • so do Nitro PDF, Foxit Reader
  • SumatraPDF shows boths texts, but the "Helvetica" one in Times-Roman. Same thing for the MuPDF viewer - of course, because SumatraPDF is built with MuPDF.
  • PDF-XChange viewer only shows the "helv" text - just like evince on Ubuntu (which you may have been using too).

The 2 contents of the PDF page look like this

print(doc.xref_stream(6))
b'\nq\nBT\n1 0 0 1 100 730.175 Tm /helv 11 Tf 1 0 0 RG 1 0 0 rg [<68656c6c6f2068656c76>]TJ\nET\nQ\n'
print(doc.xref_stream(7))
b'\nq\nBT\n1 0 0 1 100 730.175 Tm /Helvetica 11 Tf 0 1 0 RG 0 1 0 rg [<68656c6c6f2048656c766574696361>]TJ\nET\nQ\n'

Which is correct. The problem seems to be my logic which looks up existing fonts of page to avoid multiple insertions of the same font. Because every string like "helv" and "Helvetica" (with arbitrary upper/lower case choosing) all lead to the same built-in font "Helvetica", I don't insert a font for the second textbox insertion - but I fail to insert a second reference to that font when there is a different reference name ("Helvetica").

JorjMcKie avatar Sep 08 '22 09:09 JorjMcKie

BTW the TextWriter class does not insert the same font as does insert_text*. TextWriter always uses embedded font files - according to the general recommendation, not to relinquish the control to the PDF viewer as far as appearance is concerned. The "helv" font code used in TextWriter leads to using a font file that looks exactly like Helvetica - but it is not the same, and you will also see a larger PDF size because of the embedded file. This implies that using the bold, italic, bold-italic versions may lead to up to 4 embedded files - not too large ones, though. Also the reference names (which are chosen by the underlying MuPDF code) always are like Fnnn and never helv.

JorjMcKie avatar Sep 08 '22 09:09 JorjMcKie

Confirmed suspicions: I was too stingy and assumed a font as a duplicate already when the basefont name is equal - as opposed to that the font reference name must exactly match. Will be corrected in next version.

Thanks for spotting this!

JorjMcKie avatar Sep 08 '22 10:09 JorjMcKie

How did you determine that? Looking at the page with some viewer

Correct, I was using xpdf and Evince (I'd forgotten those are the same renderer) on Fedora, and only looking visually at the results.

cbm755 avatar Sep 08 '22 23:09 cbm755

I think I followed all that. One nagging doubt: suppose I use TextWriter first and then fontname="Helvetica". If TextWriter is embedding its own font, should insert_textbox really be deduping over that as well?

In particular, the following MWE shows no hello helvetica in Evince:

from fitz import *

doc = fitz.open()
pg = doc.new_page()

r = Rect(0, 100, 300, 200)
tw = TextWriter(pg.rect)
tw.append(Point(100, 40), "TextWriter says helv", fontsize=18, font=Font("helv"))
pg.write_text(rect=r, writers=tw)
pg.draw_rect(r)

excess = pg.insert_textbox(Rect(0,50,100,150), "hello helvetica", fontname="Helvetica")
print(excess)
assert excess > 0

print(pg.get_fonts())

doc.save("foo.pdf")

cbm755 avatar Sep 08 '22 23:09 cbm755

Hm, this is what my evince shows: grafik ... as expected and as it should.

TW and insert_textbox do not interfere at all - the fonts they use are completely different. You are just obscuring the situation by the way you are using TW: you do page.write_text instead of TextWriter.write_text. This creates an additional internal PDF page, which is then embedded in yours. So there is a font used by that internal PDF page which you can display if looking at page.get_fonts(True):

88.17499947547913
[(16, 'n/a', 'Type1', 'Helvetica', 'Helvetica', 'WinAnsiEncoding', 0),
 (6, 'cid', 'Type0', 'Helvetica', 'F0', 'Identity-H', 12)]

The True option causes a recursive search down to each embedded XObject. The cid font is that of the TextWriter, which references it via name "F0". The internal PDF page is stored in XObject at xref 12.

JorjMcKie avatar Sep 09 '22 05:09 JorjMcKie

That is not what I see:

image

88.17499947547913
[(6, 'cid', 'Type0', 'Helvetica', 'F0', 'Identity-H')]

file: foo.pdf

cbm755 avatar Sep 09 '22 20:09 cbm755

In [3]: fitz.version
Out[3]: ('1.20.2', '1.20.3', '20220813000001')

cbm755 avatar Sep 09 '22 20:09 cbm755

Interestingly, in xpdf I see this:

image

cbm755 avatar Sep 09 '22 20:09 cbm755

Real old school, in gv, I get:

image

cbm755 avatar Sep 09 '22 20:09 cbm755

This is the same error that I spotted. If you want you can do a quick fix in your PyMuPDF installation. Search for: grafik and remove / comment-out the yellow lines.

JorjMcKie avatar Sep 09 '22 20:09 JorjMcKie

This is in file fitz.py

JorjMcKie avatar Sep 09 '22 20:09 JorjMcKie

Oh, ok so your installation is from main branch or somewhere that has that fix?

I don't need a quick fix, just trying to help.

cbm755 avatar Sep 09 '22 20:09 cbm755

This is the fix that will be published with the next version. Those two lines are causing the error you were referring to.

JorjMcKie avatar Sep 09 '22 20:09 JorjMcKie

Fixed in 1.21.0