PdfPig
PdfPig copied to clipboard
Introduce IBlock and ILettersBlock interfaces
Part of #855
@BobLd are we waiting on any further changes for this PR?
@EliotJones I'd like to leave this PR open as the point addressed here are relevant. I won't have time short term to focus on this part of the library as i'd like to focus on bug fixes and implementing missing parts of the pdf specifications
Hi @EliotJones @BobLd
It'll be great to get some progress on this PR. I've responded to your comments @BobLd on the changes you requested:
Done:
- Split my big PR into 3 PRs
- change to
IBoundingBox - Exclude
PageXmlGeneralExporterfrom PR
Todo (was waiting for response to my comment before making the change):
- Remove
IBoundingBoxinterface from letterLetter - Remove
SetReadingOrder - Change
Lettersproperty to lazy collection (depending on response to comment)
The contentious issue is should Word, TextLine, TextBlock implement the new ILettersBlock interface and the reading orders use the IBoundingBox in a future PR. I've made my arguments why I think they should and I'm waiting on a response. I can then make the necessary changes and hopefully close this PR after that.
It'll be great to get a response as this is functionality I'm using and the PR has been hanging around for 3 months 😞
I also have 2 follow on PRs that are dependent on this: One to address the line segementer issue raised in #844 and one to change the Reading order detectors that finish off this issue #855. The code has been written for them, but I haven't raised them yet as changes may need to be made following the outcome of this discussion.
Thanks for your patience here @davebrokit. I'd suggest the following changes to move this forward (apologies if there is some point here already addressed in the discussion, I haven't read the full thing):
- Keep the
Boundsproperty on the image classes so this isn't a major breaking API change, you can have bothBoundingBoxandBoundswhereBoundingBoxjust returnsBounds, or is potentially even an explicit interface property, e.g.IBoundBox.BoundingBox => Bounds, or whatever the exact syntax is. - I think I'd lean to not exposing the
Letterscollection for now, I'd rather that complexity and cost lives in the specific reading order detectors than every consumer has to pay the performance cost of evaluating these collections. I'm wondering if we keep theILettersBlockinterface but the reading order detectors have their ownwrapperclasses for related classes. However sinceTextLineandTextBlocklive in document layout analysis namespace I can't remember what consumers cause these types to be constructed. But since it's a contentious change I'd rather we isolate it to the last-possible consumer, it's better to get this in and refine than have it stuck in this state, so this feels like a compromise no one is happy with, the best kind.
Sounds good. Will implement when I get some time