PdfPig Enforce line breaks at N characters on text extraction

Enforce line breaks at N characters on text extraction

Open asbjornu opened this issue 3 years ago • 2 comments

As mentioned in https://github.com/UglyToad/PdfPig/issues/441#issuecomment-1108396669, it would make cross-platform PDF generation much easier to verify if PdfPig was able to automatically break extracted text at a configurable number of characters. Something like this:

new ContentOrderTextExtractor.Options
{
    ReplaceWhitespaceWithSpace = true,
    BreakLinesAt = 120
}

ContentOrderTextExtractor.Options.BreakLinesAt could be an int? and its default null value would mean line breaks aren't enforced, but rather extracted from the PDF as is.

May 04 '22 21:05 asbjornu

Hi @asbjornu unfortunately this change kind of lies outside the scope of the text extractor. When I next get some time and motivation I'll look into why the Mac and Windows extraction differs, since the positions will be more-or-less identical it is probably due to a bug in position calculation, perhaps to do with the assumed width of the tab character appearing on Mac.

May 29 '22 20:05 EliotJones

Thanks for the response, @EliotJones. If you take a look at the PDFs I sent you, I believe the text is actually in different places in the source PDF as well, so this is not a fault of the extractor. I just hoped the extractor would be able to iron out the differences I'm experiencing between Mac and Windows, because the differences are of no importance to the validity of the generated documents – which is what I'm trying to assert.

May 30 '22 20:05 asbjornu

@asbjornu I'm going through closing old issues. But on this issue in particular it may be more resilient to use one of the layout analysis tools (https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis) to generate the verified content in order to not introduce variability based on file producer. For example using NearestNeighborWordExtractor followed by DocstrumBoundingBoxes to segment the file into comparable chunks.

Dec 11 '22 20:12 EliotJones

Thanks for the pointers and suggestions, @EliotJones!

Dec 11 '22 21:12 asbjornu

PdfPig PdfPig copied to clipboard

Enforce line breaks at N characters on text extraction

PdfPig
PdfPig copied to clipboard