PdfPig
PdfPig copied to clipboard
Enforce line breaks at N characters on text extraction
As mentioned in https://github.com/UglyToad/PdfPig/issues/441#issuecomment-1108396669, it would make cross-platform PDF generation much easier to verify if PdfPig was able to automatically break extracted text at a configurable number of characters. Something like this:
new ContentOrderTextExtractor.Options
{
ReplaceWhitespaceWithSpace = true,
BreakLinesAt = 120
}
ContentOrderTextExtractor.Options.BreakLinesAt could be an int? and its default null value would mean line breaks aren't enforced, but rather extracted from the PDF as is.
Hi @asbjornu unfortunately this change kind of lies outside the scope of the text extractor. When I next get some time and motivation I'll look into why the Mac and Windows extraction differs, since the positions will be more-or-less identical it is probably due to a bug in position calculation, perhaps to do with the assumed width of the tab character appearing on Mac.
Thanks for the response, @EliotJones. If you take a look at the PDFs I sent you, I believe the text is actually in different places in the source PDF as well, so this is not a fault of the extractor. I just hoped the extractor would be able to iron out the differences I'm experiencing between Mac and Windows, because the differences are of no importance to the validity of the generated documents – which is what I'm trying to assert.
@asbjornu I'm going through closing old issues. But on this issue in particular it may be more resilient to use one of the layout analysis tools (https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis) to generate the verified content in order to not introduce variability based on file producer. For example using NearestNeighborWordExtractor followed by DocstrumBoundingBoxes to segment the file into comparable chunks.
Thanks for the pointers and suggestions, @EliotJones!