PdfPig
PdfPig copied to clipboard
Incorrect bounding box on TimesNewRomanPSMT
Hi. I found an issue on the bounding boxes for font TimesNewRomanPSMT,
document11.pdf
If you try to extract bounding box for word "difficulty" on the first page
you will see that bounding box shifted. I've tested that case in the itextsharp, and find out that it's using font descender to calculate the bounding box.
Looking into the code I can't find any usage of descender of the font. Is that correct? May you advice how to fix this?
Hi @grinay, would you be able to share a screenshot of the itextsharp bounding box?
The bounding box here loosk correct to me, but I might be wrong. For words, the bounding boxes start from the baseline points. Maybe compare the bounding box of this word with the bounding boxes of its letters.
Hi @BobLd , I've made a simple test with textsharp https://gist.github.com/grinay/ff5bba3c0b9b6a81f11413ca669583ff. It outputs the positions yStart: 552,51666, yEnd: 564,51666. Here is the example of how I originally extracted positions for letters, and based on this information later I get the max and min of Y position.
foreach (var letter in word.Letters){
var rectangle = new[]
{
(float)letter.StartBaseLine.X - (float)pigPage.MediaBox.Bounds.Left,
(float)textLine.BoundingBox.Bottom - (float)pigPage.MediaBox.Bounds.Bottom,
(float)letter.EndBaseLine.X - (float)pigPage.MediaBox.Bounds.Left,
(float)textLine.BoundingBox.Top - (float)pigPage.MediaBox.Bounds.Bottom,
};
....
}
I was able to fix it without modifying pdfpig. What I did is wrote a function which using reflection and goes down to the resource store of Page , resourceStore -> loadedFonts and get font descriptor, currently I'm only handling font types Type1FontSimple and TrueTypeSimpleFont. After I have access to font descriptor I may correct Y position:
//current font
var fontName = word.Letters.First().FontName;
var pointSize = word.Letters.First().PointSize;
var fontDescriptor = fontDescriptors.FirstOrDefault(x => x.FontName == fontName);
foreach (var letter in word.Letters)
{
//In case font is with Descent and Ascent, correct bounding box
if (fontDescriptor != null)
{
//correct bounding box with fontDescriptor.Descent
var descent = fontDescriptor.Descent;
var ascent = fontDescriptor.Ascent;
var fontBbox = fontDescriptor.BoundingBox;
//This logic taken from itextsharp
var maxAscent = Math.Max((decimal)fontBbox.TopRight.Y, ascent);
var minDescent = Math.Min((decimal)fontBbox.BottomRight.Y, descent);
ascent = maxAscent * 1000 / (maxAscent - minDescent);
descent = minDescent * 1000 / (maxAscent - minDescent);
//
var bboxHeight = descent / 1000 * (decimal)pointSize;
var bboxHeight2 = ascent / 1000 * (decimal)pointSize;
rectangle = new[]
{
(float)letter.StartBaseLine.X - (float)pigPage.MediaBox.Bounds.Left,
(float)letter.StartBaseLine.Y + (float)bboxHeight -
(float)pigPage.MediaBox.Bounds.Bottom,
(float)letter.EndBaseLine.X - (float)pigPage.MediaBox.Bounds.Left,
(float)letter.StartBaseLine.Y + (float)bboxHeight2 -
(float)pigPage.MediaBox.Bounds.Bottom,
};
}
...
}
After this fix everything looks correct. I've tested on other documents, and it works.
@grinay Thanks a lot for the great explanation of your fix. I had a look on my side and I think there's an easier way to achieve what you want to do.
Can you try to compute the word bounding box using the following:
foreach (var word in page.GetWords())
{
var first = word.Letters[0];
var last = word.Letters[word.Letters.Count - 1];
double x1 = first.GlyphRectangle.TopLeft.X;
double x2 = last.GlyphRectangle.TopRight.X;
double y1 = word.Letters.Max(l => l.GlyphRectangle.TopLeft.Y);
double y2 = word.Letters.Min(l => l.GlyphRectangle.BottomLeft.Y);
var bbox = new PdfRectangle(x1, y1, x2, y2); // This is the bbox you're looking for
DrawRectangle(bbox, canvas, redPaint, size.Height, Scale);
}
I've pushed my code in https://github.com/BobLd/PdfPig/tree/issue-749-word-bbox (NB: this branch is not dirrectly based on master and contain changes that are in #757).
Have a look in the test GenerateLetterGlyphImages, Issue749(). It will generate an image saved in the ImagesGlyphs folder:
You can see (above) that the "difficulty" bbox is now what you want. One difference between my code and yours might be the bbox of words that do not contains letters with ascenders and descenders, for example "The" and "in", in "The difficulty in".
Let's leave the issue openned as I'd like to do some further tests
@BobLd yes looks good at the image. As I understand this will not work at the the current master branch right? We should wait until you merge your changes? And btw on image you shows it some words, like "program" bounding upper boundary shifter to the bottom, which was the reason we had to extend boxes up to 20%, as in other documents it was very incorrect for us. After I applied that fix with font descriptor, this problem disappears, and we removed code which extended the boxes.
@grinay I think you can try my fix with the current PdfPig version, or with the latest pre-released version. But do let me know if that does not work.
As you point out, I do think some issue remains (especially upper boundary problem) and your approach might be the solution. This is what I want to look into.
I am refering to the upper boundary in your initial screenshot:
The upper boundary is too high compared to what it should be, and that might be related to ascender/descender. The lower is correct though, as we use the base line points of letters (i.e. not the bottom points, so it excludes any descender) as the bottom for word bounding box (this is different from what you are looking for).
Let's leave the issue open for now so that we keep that in mind