PdfPig
PdfPig copied to clipboard
Reading order
To show a bug in the reading order, I have extracted the problematic part from a document and created an example:
https://drive.google.com/file/d/17mU_P4hwUMDXXTeqIF1meC9VpS9bkznM/view?usp=sharing
The order in which it is read is :
- (i)
- Delivery i
- [Point form of the Services to be completed]
- [Point form of the deliverables to be provided]
- Methodology .... + all remaing text
- [EXAMPLE: Delivery of Features in accordance with the Delivery Plan.]
With the following code:
var words = page.GetWords(NearestNeighbourWordExtractor.Instance).ToList();
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
var orderedBlocks = new UnsupervisedReadingOrderDetector(spatialReasoningRule: UnsupervisedReadingOrderDetector.SpatialReasoningRules.RowWise, useRenderingOrder: false).Get(blocks).ToList();
var finalBlocks = new List<UglyToad.PdfPig.DocumentLayoutAnalysis.TextBlock>(orderedBlocks);
While looking at the document, it looks like there is no reason for the algorithm to fail that much.