PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

Reading order

Open christopher5106 opened this issue 4 years ago • 0 comments

To show a bug in the reading order, I have extracted the problematic part from a document and created an example:

https://drive.google.com/file/d/17mU_P4hwUMDXXTeqIF1meC9VpS9bkznM/view?usp=sharing

The order in which it is read is :

  1. (i)
  2. Delivery i
  3. [Point form of the Services to be completed]
  4. [Point form of the deliverables to be provided]
  5. Methodology .... + all remaing text
  6. [EXAMPLE: Delivery of Features in accordance with the Delivery Plan.]

With the following code:

                    var words = page.GetWords(NearestNeighbourWordExtractor.Instance).ToList();
                    var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
                    var orderedBlocks = new UnsupervisedReadingOrderDetector(spatialReasoningRule: UnsupervisedReadingOrderDetector.SpatialReasoningRules.RowWise, useRenderingOrder: false).Get(blocks).ToList();
                    var finalBlocks = new List<UglyToad.PdfPig.DocumentLayoutAnalysis.TextBlock>(orderedBlocks);

While looking at the document, it looks like there is no reason for the algorithm to fail that much.

christopher5106 avatar Feb 22 '21 11:02 christopher5106