PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

PDFTextStripper class

Open ivanlabsii opened this issue 5 years ago • 8 comments

First to say that I was (and I am still) very happy to see that there is a port of PDFBox to .NET.

I understand that it is work in progress, but I would like to stress the need to port the PDFTextStripper class which doesn't appear to be ported at the moment. For me and probably for many other developers it is the only need why I use PDFBox at all.

Hopefully it will be implemented in the future. Looking forward on it. If it is possible keep this issue open until it is implemented so that all people interested in this feature can get easily notified on that change.

ivanlabsii avatar Jun 09 '20 11:06 ivanlabsii

Hi @ivanicin, thanks for the feedback! Can you give example on how you'd use PDFTextStripper, is it to extract text? If it's the case, what would be the difference with PdfPig's ContentOrderTextExtractor?

BobLd avatar Jun 10 '20 09:06 BobLd

Hi @ivanicin, thanks for the feedback! Can you give example on how you'd use PDFTextStripper, is it to extract text? If it's the case, what would be the difference with PdfPig's ContentOrderTextExtractor?

This seems to be it though I couldn't make it work but that's another issue that I'll file separately.

Currently this is in pre-release nuget, so not sure if you close the bug until it is in the regular release or as soon as you get some sort of solution, so I'll leave to owners to decide on whether to close the issue immediately or not.

ivanlabsii avatar Jun 10 '20 10:06 ivanlabsii

Also I am not sure if this should be the same or new issue, but when I try to parse this file: https://s3-us-west-2.amazonaws.com/pressbooks-samplefiles/LewisTheme/The-Problems-of-Philosophy-LewisTheme.pdf it appears that results are far from PDFTextStripper quality - at first glance I can't see any problems with PDFTextStripper, while there are significant problems with ContentOrderTextExtractor, like missing spaces where expected or new lines where not expected or doubling some words, they appear on many places in this text so it is quite easy to spot.

And just to note that this document isn't particularly complex so probably there are even worse problems on more complex documents.

So I would leave this issue open until you can get the similar quality on this sample document of ContentOrderTextExtractor when compared to PDFTextStripper (which should be feasible as it is part of PDFBox that you try to port), or we can close this issue and I can file a new one if that is preferred.

ivanlabsii avatar Jun 10 '20 11:06 ivanlabsii

I think this document is a very interesting case indeed, for example for doubling words (which I guess is related to faking bold by doubling letters), and should be handled by default in text extractor. I let @EliotJones decide, but I think this is a good issue 😃

BobLd avatar Jun 10 '20 14:06 BobLd

for reference on the duplicate letters, this is how PDFTextStripper seems to handle it:

https://github.com/apache/pdfbox/blob/64add684c6b8d9845377f31c619a107018e05f31/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L771-L819

BobLd avatar Jun 11 '20 15:06 BobLd

I'm marking this as help wanted since I won't have time to work on this for the next few months at least.

For anyone interested in picking this up as @BobLd linked the source for PDFBox's PDFTextStripper is here https://github.com/apache/pdfbox/blob/64add684c6b8d9845377f31c619a107018e05f31/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java

This won't map 1:1 with PdfPig concepts so it will need a bit rework to fit into the PdfPig API. I'd suggest creating the class in UglyToad.PdfPig.DocumentLayoutAnalysis.TextExtractor project and namespace. We're probably fine keeping a static class with a single public static method GetText(PdfDocument document) which re-uses most of the existing PDFBox logic.

EliotJones avatar Jun 20 '20 11:06 EliotJones

If someone ports this so that it provides the same results as PDFBox on the document above (and possibly few more that I may check) I'll provide a small incentive like 100$. To ensure that everything is still valid just post here before the start and I'll post if I haven't found an alternative solution and then I'll post a project on Freelancer with the deposit there.

ivanlabsii avatar Jun 20 '20 18:06 ivanlabsii

happy to help! maybe it will be easier to discuss on the PdfPig's Gitter

BobLd avatar Jun 21 '20 11:06 BobLd