PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

UnsupervisedReadingOrder orders 2 blocks on the same row out of order

Open davebrokit opened this issue 1 year ago • 1 comments

Unsupervised reading order may order 2 blocks on the same row out of order. It doesn't try to reorder the blocks when they are on the same row, and uses the default order the elements were passed in.

The problem seems to be caused here:

https://github.com/UglyToad/PdfPig/blob/d86c2f44f09ebb9fdf4fc09c16d9eb6ae5839f2c/src/UglyToad.PdfPig.DocumentLayoutAnalysis/ReadingOrderDetector/UnsupervisedReadingOrderDetector.cs#L273

https://github.com/UglyToad/PdfPig/blob/d86c2f44f09ebb9fdf4fc09c16d9eb6ae5839f2c/src/UglyToad.PdfPig.DocumentLayoutAnalysis/ReadingOrderDetector/UnsupervisedReadingOrderDetector.cs#L357

Due to the ordering of the if statements the code will always select IntervalRelations.PrecedesI or IntervalRelations.Precedes and ignore other cases.

I can raise a PR with a fix if this is the case but wanted to check with the experts (but would prefer if you guys did it as it has a potential big impact).

Minimum test to check scenario

using NUnit.Framework;
using UglyToad.PdfPig.Content;
using UglyToad.PdfPig.DocumentLayoutAnalysis.ReadingOrderDetector;
using UglyToad.PdfPig.DocumentLayoutAnalysis;
using UglyToad.PdfPig.Core;

namespace ReadingOrderDectorTests
{
    public class ReadingOrderDectorTest
    {
        [Test]
        public void ReadingOrderDoesNotOrderRowContents()
        {
            var letterA = new Letter("a",
                new PdfRectangle(new PdfPoint(0, 0), new PdfPoint(10, 10)),
                new PdfPoint(0, 0),
                new PdfPoint(10, 0),
                10, 1, null, TextRenderingMode.NeitherClip, null, null, 0, 0);// These don't matter
            var leftTextBlock = new TextBlock(new[] { new TextLine(new[] { new Word(new[] { letterA }) }) });

            var letterB = new Letter("b",
                new PdfRectangle(new PdfPoint(100, 0), new PdfPoint(110, 10)),
                new PdfPoint(100, 0),
                new PdfPoint(110, 0),
                10, 1, null, TextRenderingMode.NeitherClip, null, null, 0, 0);// These don't matter
            var rightTextBlock = new TextBlock(new[] { new TextLine(new[] { new Word(new[] { letterB }) }) });

            // We deliberately submit in the wrong order
            var textBlocks = new List<TextBlock>() { rightTextBlock, leftTextBlock };

            var unsupervisedReadingOrderDetector = new UnsupervisedReadingOrderDetector(5, UnsupervisedReadingOrderDetector.SpatialReasoningRules.RowWise);
            var orderedBlocks = unsupervisedReadingOrderDetector.Get(textBlocks);

            var ordered = orderedBlocks.OrderBy(x => x.ReadingOrder).ToList();
            Assert.That(ordered[0].BoundingBox.Left, Is.EqualTo(0));
            Assert.That(ordered[1].BoundingBox.Left, Is.EqualTo(100));
        }
    }
}

davebrokit avatar May 20 '24 19:05 davebrokit

Hi @davebrokit thanks a lot for that, I haven't checked but what you say seems to make sense. Happy for you to create a PR, I'll review it. If you can add tests, that'd be amazing. Thx!

BobLd avatar May 20 '24 20:05 BobLd

PR raised

davebrokit avatar May 27 '24 18:05 davebrokit