Glyph (a.k.a `LTChar`) bounding boxes have incorrect height
Instead of using the ascent (or the FontBBox which is also an option) to calculate the height of glyphs, pdfminer.layout.LTChar just uses a single unit in text space. This is almost right for a wide range of fonts, but is in no way required or guaranteed by PostScript, PDF, TrueType, etc, etc, etc.
You can see this most clearly in the PDFs in https://github.com/dhdaines/playa/issues/79
The fix is really simple, but it may cause unintended consequences for code that relied on the bogus glyph heights in some way (see https://xkcd.com/1172/), basically:
diff --git a/pdfminer/layout.py b/pdfminer/layout.py
index 2189445..970362e 100644
--- a/pdfminer/layout.py
+++ b/pdfminer/layout.py
@@ -386,7 +386,8 @@ class LTChar(LTComponent, LTText):
else:
# horizontal
descent = font.get_descent() * fontsize
- bbox = (0, descent + rise, self.adv, descent + rise + fontsize)
+ ascent = font.get_ascent() * fontsize
+ bbox = (0, descent + rise, self.adv, ascent + rise)
(a, b, c, d, e, f) = self.matrix
self.upright = a * d * scaling > 0 and b * c <= 0
(x0, y0, x1, y1) = apply_matrix_rect(self.matrix, bbox)
@dhdaines This patch really solved my problem. Thank you!
Testing file: https://www.heavenandearthdesigns.com/freecharts/Freebie%20Galaxy%20Gazing%20Chart%20Pack.pdf
@pietermarsman Would you mind taking a look? I've attached my testing file above.