PdfPig
PdfPig copied to clipboard
UnsupervisedReadingOrder orders 2 blocks on the same row out of order
Unsupervised reading order may order 2 blocks on the same row out of order. It doesn't try to reorder the blocks when they are on the same row, and uses the default order the elements were passed in.
The problem seems to be caused here:
https://github.com/UglyToad/PdfPig/blob/d86c2f44f09ebb9fdf4fc09c16d9eb6ae5839f2c/src/UglyToad.PdfPig.DocumentLayoutAnalysis/ReadingOrderDetector/UnsupervisedReadingOrderDetector.cs#L273
https://github.com/UglyToad/PdfPig/blob/d86c2f44f09ebb9fdf4fc09c16d9eb6ae5839f2c/src/UglyToad.PdfPig.DocumentLayoutAnalysis/ReadingOrderDetector/UnsupervisedReadingOrderDetector.cs#L357
Due to the ordering of the if statements the code will always select IntervalRelations.PrecedesI or IntervalRelations.Precedes and ignore other cases.
I can raise a PR with a fix if this is the case but wanted to check with the experts (but would prefer if you guys did it as it has a potential big impact).
Minimum test to check scenario
using NUnit.Framework;
using UglyToad.PdfPig.Content;
using UglyToad.PdfPig.DocumentLayoutAnalysis.ReadingOrderDetector;
using UglyToad.PdfPig.DocumentLayoutAnalysis;
using UglyToad.PdfPig.Core;
namespace ReadingOrderDectorTests
{
public class ReadingOrderDectorTest
{
[Test]
public void ReadingOrderDoesNotOrderRowContents()
{
var letterA = new Letter("a",
new PdfRectangle(new PdfPoint(0, 0), new PdfPoint(10, 10)),
new PdfPoint(0, 0),
new PdfPoint(10, 0),
10, 1, null, TextRenderingMode.NeitherClip, null, null, 0, 0);// These don't matter
var leftTextBlock = new TextBlock(new[] { new TextLine(new[] { new Word(new[] { letterA }) }) });
var letterB = new Letter("b",
new PdfRectangle(new PdfPoint(100, 0), new PdfPoint(110, 10)),
new PdfPoint(100, 0),
new PdfPoint(110, 0),
10, 1, null, TextRenderingMode.NeitherClip, null, null, 0, 0);// These don't matter
var rightTextBlock = new TextBlock(new[] { new TextLine(new[] { new Word(new[] { letterB }) }) });
// We deliberately submit in the wrong order
var textBlocks = new List<TextBlock>() { rightTextBlock, leftTextBlock };
var unsupervisedReadingOrderDetector = new UnsupervisedReadingOrderDetector(5, UnsupervisedReadingOrderDetector.SpatialReasoningRules.RowWise);
var orderedBlocks = unsupervisedReadingOrderDetector.Get(textBlocks);
var ordered = orderedBlocks.OrderBy(x => x.ReadingOrder).ToList();
Assert.That(ordered[0].BoundingBox.Left, Is.EqualTo(0));
Assert.That(ordered[1].BoundingBox.Left, Is.EqualTo(100));
}
}
}
Hi @davebrokit thanks a lot for that, I haven't checked but what you say seems to make sense. Happy for you to create a PR, I'll review it. If you can add tests, that'd be amazing. Thx!
PR raised