PdfPig
PdfPig copied to clipboard
Wrong Oritentation of letter
I have following Problem i have a 2d technical drawing where text is written in every direction.
The GetWords() Method with NearestNeighbourWordExtractor works fine for me except for this example.
In the Image you can see a part of the PDF.
Where light blue is the word box.
Dark blue the letter box and green the Location.
My Problem now is that the Letter has the TextOrientation Horizontal which leads to a wrongly drawn Text box for it and maybe that the 7 and the 9 cant find each other with nearest neighbour.
I have tried to create a pdf which has the same problems but i couldnt get it to work.
Because there is a nda i cant share the file, but maybe you could point me in the right direction to find the problem and maybe find a solution for it
@muhmuhhum thanks for openning the issue. I understand you can't sahre the pdf document but would you mind sharing the code you're using to draw the bounding box?
Also, is all the text always draw as horizontal text? For example, do you also have the issue with the "R 12,5" text?
@BobLd Thx for the quick response its wrong for some of the other letters on the document, but for nearly all, the letter has the correct TextOrientation. I already found that the Location.EndLine und Location.StartLine for the letters with wrong TextOrientation are at the same point.
Here for the 7:
And the 2(Of the 25):
Here a bigger cutout with more words marked.
For the drawing i have to change some values cause skia uses top left as origin and i have to calculate the new position with 300 dpi:
var wordBox = GetRotatedRect(word.BoundingBox);
canvas.DrawRect((float)(wordBox.blX / 72 * 300),
(float)((page.Height - wordBox.blY - wordBox.height) / 72 * 300), (float)(wordBox.width / 72 * 300),
(float)(wordBox.height / 72 * 300), wordPaint);
And GetRotatedRect:
(double blX, double blY, double width, double height) GetRotatedRect(PdfRectangle boundingBox)
{
var xPoints = new List<double>
{
boundingBox.BottomLeft.X,
boundingBox.TopLeft.X,
boundingBox.TopRight.X,
boundingBox.BottomRight.X
};
var yPoints = new List<double>
{
boundingBox.BottomLeft.Y,
boundingBox.TopLeft.Y,
boundingBox.TopRight.Y,
boundingBox.BottomRight.Y
};
return (xPoints.Min(), yPoints.Min(), xPoints.Max() - xPoints.Min(), yPoints.Max() - yPoints.Min());
}
I already found that the Location.EndLine und Location.StartLine for the letters with wrong TextOrientation are at the same point. Thanks for that, that's very usefull. I'll get back to you on that shortly.
Regarding the rendering with Skia, you indeed need to invert the Y axis. I think one thing that causes your draw bounding boxes to always be Horizontal is that you use canvas.DrawRect()
, which I think always draws axis aligned rectangles.
Could you instead use the canvas.DrawPath()
method? You can use the emthod below:
using (var rect = new SKPath())
{
rect.MoveTo((float)transformedPdfBounds.BottomLeft.X, (float)transformedPdfBounds.BottomLeft.Y);
rect.LineTo((float)transformedPdfBounds.TopLeft.X, (float)transformedPdfBounds.TopLeft.Y);
rect.LineTo((float)transformedPdfBounds.TopRight.X, (float)transformedPdfBounds.TopRight.Y);
rect.LineTo((float)transformedPdfBounds.BottomRight.X, (float)transformedPdfBounds.BottomRight.Y);
rect.Close();
_canvas.DrawPath(rect, new SKPaint() { Color = SKColors.Black, Style = SKPaintStyle.Stroke });
}
where transformedPdfBounds
is your PdfRectangle
boundingBox, with top left as origin (ready for Skia).
Oh it is intended that the bounding boxes are always horizontal sry i have missed this question in your original answer that is what GetRotatedRect is for to get the horizontal box around the word. Sry if that caused some confusion.
Ok after some research i think i found the problem. The Pdf has Fonts with Widths of 0 which leads to some weird behavior
@muhmuhhum sounds good, thanks a lot for that. The code that computes the text orientation is here https://github.com/UglyToad/PdfPig/blob/4537ec3f02c9f1f12e17e3a2e03f411c41d027de/src/UglyToad.PdfPig/Content/Letter.cs#L139C1-L162C10
If you want, you can try to fix it, I'll try to have a look on my side.
In the meantime, you can try using the NearestNeighbourWordExtractor
while ignoring the TextOrientation
as follow:
var options = new NearestNeighbourWordExtractor.NearestNeighbourWordExtractorOptions()
{
GroupByOrientation = false
};
var nnWordExtracor = new NearestNeighbourWordExtractor(options);
Let me know if that helps
@BobLd Soory for the late answer my current workauround for this is that when i try to extract the words i check for letters where the letter.StartBaseLine is the same Point as letter.EndBaseLine and then replace them with bottomLeft and BottomRight of the glyph box and set the TextOrientation based on the Rotation of the GlyphRectangle. This may ignores the possible extra width for the Letters but i havent found a good other solution. Now i just ask myself how programms like Adobe Acrobat can draw this pdf cause as far as i understand a character with width of 0 should be drawn as so, but it is displayed normally just as every other character.