PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

Introduce IBlock and ILettersBlock interfaces

Open davebrokit opened this issue 1 year ago • 3 comments

Part of #855

davebrokit avatar Jun 30 '24 07:06 davebrokit

@BobLd are we waiting on any further changes for this PR?

EliotJones avatar Sep 29 '24 15:09 EliotJones

@EliotJones I'd like to leave this PR open as the point addressed here are relevant. I won't have time short term to focus on this part of the library as i'd like to focus on bug fixes and implementing missing parts of the pdf specifications

BobLd avatar Sep 29 '24 17:09 BobLd

Hi @EliotJones @BobLd

It'll be great to get some progress on this PR. I've responded to your comments @BobLd on the changes you requested:

Done:

  • Split my big PR into 3 PRs
  • change to IBoundingBox
  • Exclude PageXmlGeneralExporter from PR

Todo (was waiting for response to my comment before making the change):

  • Remove IBoundingBox interface from letter Letter
  • Remove SetReadingOrder
  • Change Letters property to lazy collection (depending on response to comment)

The contentious issue is should Word, TextLine, TextBlock implement the new ILettersBlock interface and the reading orders use the IBoundingBox in a future PR. I've made my arguments why I think they should and I'm waiting on a response. I can then make the necessary changes and hopefully close this PR after that.

It'll be great to get a response as this is functionality I'm using and the PR has been hanging around for 3 months 😞

I also have 2 follow on PRs that are dependent on this: One to address the line segementer issue raised in #844 and one to change the Reading order detectors that finish off this issue #855. The code has been written for them, but I haven't raised them yet as changes may need to be made following the outcome of this discussion.

davebrokit avatar Sep 30 '24 08:09 davebrokit

Thanks for your patience here @davebrokit. I'd suggest the following changes to move this forward (apologies if there is some point here already addressed in the discussion, I haven't read the full thing):

  • Keep the Bounds property on the image classes so this isn't a major breaking API change, you can have both BoundingBox and Bounds where BoundingBox just returns Bounds, or is potentially even an explicit interface property, e.g. IBoundBox.BoundingBox => Bounds, or whatever the exact syntax is.
  • I think I'd lean to not exposing the Letters collection for now, I'd rather that complexity and cost lives in the specific reading order detectors than every consumer has to pay the performance cost of evaluating these collections. I'm wondering if we keep the ILettersBlock interface but the reading order detectors have their own wrapper classes for related classes. However since TextLine and TextBlock live in document layout analysis namespace I can't remember what consumers cause these types to be constructed. But since it's a contentious change I'd rather we isolate it to the last-possible consumer, it's better to get this in and refine than have it stuck in this state, so this feels like a compromise no one is happy with, the best kind.

EliotJones avatar Jul 20 '25 16:07 EliotJones

Sounds good. Will implement when I get some time

davebrokit avatar Jul 20 '25 20:07 davebrokit