pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Glyph (a.k.a `LTChar`) bounding boxes have incorrect height

Open dhdaines opened this issue 8 months ago • 1 comments

Instead of using the ascent (or the FontBBox which is also an option) to calculate the height of glyphs, pdfminer.layout.LTChar just uses a single unit in text space. This is almost right for a wide range of fonts, but is in no way required or guaranteed by PostScript, PDF, TrueType, etc, etc, etc.

You can see this most clearly in the PDFs in https://github.com/dhdaines/playa/issues/79

The fix is really simple, but it may cause unintended consequences for code that relied on the bogus glyph heights in some way (see https://xkcd.com/1172/), basically:

diff --git a/pdfminer/layout.py b/pdfminer/layout.py
index 2189445..970362e 100644
--- a/pdfminer/layout.py
+++ b/pdfminer/layout.py
@@ -386,7 +386,8 @@ class LTChar(LTComponent, LTText):
         else:
             # horizontal
             descent = font.get_descent() * fontsize
-            bbox = (0, descent + rise, self.adv, descent + rise + fontsize)
+            ascent = font.get_ascent() * fontsize
+            bbox = (0, descent + rise, self.adv, ascent + rise)
         (a, b, c, d, e, f) = self.matrix
         self.upright = a * d * scaling > 0 and b * c <= 0
         (x0, y0, x1, y1) = apply_matrix_rect(self.matrix, bbox)

dhdaines avatar May 09 '25 20:05 dhdaines

@dhdaines This patch really solved my problem. Thank you!

Testing file: https://www.heavenandearthdesigns.com/freecharts/Freebie%20Galaxy%20Gazing%20Chart%20Pack.pdf

Image Image

@pietermarsman Would you mind taking a look? I've attached my testing file above.

lihuanglx avatar Aug 25 '25 10:08 lihuanglx