pdfannots icon indicating copy to clipboard operation
pdfannots copied to clipboard

Feature: Outputting an annotation and the entire sentence where the annotation is located

Open Shadowalker1995 opened this issue 1 year ago • 5 comments

Thank you so much for developing this module, it's fantastic. Is it possible to implement the function of simultaneously outputting an annotation and the entire sentence where the annotation is located?

If possible, please guide me on the general principle, Thanks

Shadowalker1995 avatar Aug 30 '23 04:08 Shadowalker1995

Capturing text before/after an annotation is implemented in the code as "context", but is currently used only for strikeout annotations. My expectation was that anyone adding a comment on a specific sentence would use highlight annotations, where the highlight covers the text you want to include with the annotation.

If you did want to include context with other annotation types, you could probably modify Annotation.wants_context to capture context for those types where you need it, then implement some heuristics for deciding where the sentence boundaries lie -- the current algorithm for this is implemented by trim_context in the markdown printer.

I'm not sure I'd accept such a change in this repo though. It sounds pretty hard to manage -- in particular identifying sentence boundaries reliably is likely to be problematic, so this could easily produce undesired output, and if you want sentences the next user will want paragraphs, etc. I think perhaps you should be willing to do a bit more work when annotating the document in the first place :)

0xabu avatar Aug 30 '23 10:08 0xabu

Capturing text before/after an annotation is implemented in the code as "context", but is currently used only for strikeout annotations. My expectation was that anyone adding a comment on a specific sentence would use highlight annotations, where the highlight covers the text you want to include with the annotation.

If you did want to include context with other annotation types, you could probably modify Annotation.wants_context to capture context for those types where you need it, then implement some heuristics for deciding where the sentence boundaries lie -- the current algorithm for this is implemented by trim_context in the markdown printer.

I'm not sure I'd accept such a change in this repo though. It sounds pretty hard to manage -- in particular identifying sentence boundaries reliably is likely to be problematic, so this could easily produce undesired output, and if you want sentences the next user will want paragraphs, etc. I think perhaps you should be willing to do a bit more work when annotating the document in the first place :)

Thanks for your quick reply. I can understand what you mean and could implement the code as you said.

As a Ph.D. candidate, my main task involves reading and annotating literature. Your tool has been helpful in exporting my annotations in a specific format, which has significantly aided me in my work. However, there is another scenario in my annotations that marks well-used words and phrases, and I hope to be able to export these annotations along with their context (i.e., the whole sentence). This would help me better comprehend the meaning and usage of the phrase when I review my notes.

I also have checked the pdfminer module. It said if we want to extract all of the text. We could do:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

How can I use the annot to match an element? Is this possible?

Shadowalker1995 avatar Aug 30 '23 14:08 Shadowalker1995

I also have checked the pdfminer module. It said if we want to extract all of the text. We could do:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

How can I use the annot to match an element? Is this possible?

That's basically the problem at the core of pdfannots :) Most page elements have x/y coordinates, and each annotation consists of one or more bounding boxes, so the problem mostly boils down to processing the text elements and then checking for intersections between them and the annotation boxes. However, you can't just use LTTextContainers for this as those are too large (e.g. entire boxes or lines), rather you have to look at the characters inside them. The logic for this is in _PDFProcessor.render and its helpers like test_boxes.

0xabu avatar Aug 30 '23 14:08 0xabu

I will have a try. thank you for your kindly guide

Shadowalker1995 avatar Sep 01 '23 07:09 Shadowalker1995

I have done the primary implementation. here is my repo.

Shadowalker1995 avatar Sep 04 '23 15:09 Shadowalker1995