Wrong text order in Pdf with NearestNeighbour
So i have a pdf from a customer which i can`t share soory. All read text looks good except for one block, the ordering of the text is wrong. This is the original Text:
And this is the Text read by PdfPig
x211.75-6g
Now if i take a look at the start and base lines for the letters, it seams like it should work just fine?
I already tried to change the DistanceMeasure setting from Euclid to Manhatten but it did not work
Here is also an example for the rest of the field of a letter I already checked and they have the same TextOrientation:
To clarify im using the default options for NearestNeighbour
Can you confirm the screenshot for the list of letters is what's inside the word object (and not the page)? Also, is the wrongly ordered text a single word? Or many words in a line?
This is very odd because as you said, the order should be correct based on the bounding boxes
@BobLd Yes can confirm that it is read as a wrongly ordered text in a single word
My workaround for now is to check if the all have the the same StartBaseLine.Y if they are Horizontal and then reorder them based on the X and it seems to work for now
foreach(var word in page.GetWords(page.GetWords(new NearestNeighbourWordExtractor())
{
if (!string.IsNullOrWhiteSpace(word.Text) && (word.Letters?.All(x => x.TextOrientation == TextOrientation.Horizontal) ?? false))
{
var yCoordinate = word.Letters.First().StartBaseLine.Y;
if (word.Letters.All(x => x.StartBaseLine.Y == yCoordinate))
{
orderdLetters = word.Letters.OrderBy(x => x.StartBaseLine.X).ToList();
}
}
}
In line with #1095 I'll close this since we can't really dig in absent the document. The bug could be a floating point error or something else, it's impossible to tell without the document.
Hi, I am having the same issue, and I am not sure if this started happening recently or not, but words extracted (using the same nearest neighbor word extractor as the OP) have their letters in the wrong order.
I don't speak spanish myself, but the letters are definitely out of order when looking at the sentence
Here we see that the letters of the first word are out of order.
Attached is the example.pdf file
I can easily sort the letters manually for each line that comes out of the extractor, but the documents I process can be long, and I'd like to avoid the workaround. Any guidance on this? Perhaps I am using the library incorrectly.
To add to this, the issue does not happen with the DefaultWordExtractor
@jonathandlo I believe NearestNeighbour fails here because the letters bounding box are messy.
For example this is the bounding box for the t only (from Acrobat reader):
The NearestNeighbour algo works by taking a given letter's bbox End Base Point (bottom right point of the bbox) and looking for the bbox that has the closest Start base line point (bottom left point of the bbox). Because some letters bbox in this document overlap greatly, the output is messy. See more about NearestNeighbour here
I want to try and investigate if there is a potential fix to that to make extraction more robust so I'll leave the the issue open. In the meantime, unfortunately you will need to sort letters by Start base line point...
Thanks for the very quick response. I see, yes, the files I process are sometimes messy like this. Since the text is only horizontal, I am able to use the default word extractor as a workaround for now, but I'll keep an eye out for any improvements to the extraction.
I do often process strange pdf files, multi lingual, formatting issues, straight up overlapping text, etc. If it's any help to you, I am happy to report extraction oddities I come across in the future.
And thank you for your work on this library. It's an oasis in the desert of C# options for PDF extraction.