pdfannots
pdfannots copied to clipboard
Feature: Outputting an annotation and the entire sentence where the annotation is located
Thank you so much for developing this module, it's fantastic. Is it possible to implement the function of simultaneously outputting an annotation and the entire sentence where the annotation is located?
If possible, please guide me on the general principle, Thanks
Capturing text before/after an annotation is implemented in the code as "context", but is currently used only for strikeout annotations. My expectation was that anyone adding a comment on a specific sentence would use highlight annotations, where the highlight covers the text you want to include with the annotation.
If you did want to include context with other annotation types, you could probably modify Annotation.wants_context
to capture context for those types where you need it, then implement some heuristics for deciding where the sentence boundaries lie -- the current algorithm for this is implemented by trim_context
in the markdown printer.
I'm not sure I'd accept such a change in this repo though. It sounds pretty hard to manage -- in particular identifying sentence boundaries reliably is likely to be problematic, so this could easily produce undesired output, and if you want sentences the next user will want paragraphs, etc. I think perhaps you should be willing to do a bit more work when annotating the document in the first place :)
Capturing text before/after an annotation is implemented in the code as "context", but is currently used only for strikeout annotations. My expectation was that anyone adding a comment on a specific sentence would use highlight annotations, where the highlight covers the text you want to include with the annotation.
If you did want to include context with other annotation types, you could probably modify
Annotation.wants_context
to capture context for those types where you need it, then implement some heuristics for deciding where the sentence boundaries lie -- the current algorithm for this is implemented bytrim_context
in the markdown printer.I'm not sure I'd accept such a change in this repo though. It sounds pretty hard to manage -- in particular identifying sentence boundaries reliably is likely to be problematic, so this could easily produce undesired output, and if you want sentences the next user will want paragraphs, etc. I think perhaps you should be willing to do a bit more work when annotating the document in the first place :)
Thanks for your quick reply. I can understand what you mean and could implement the code as you said.
As a Ph.D. candidate, my main task involves reading and annotating literature. Your tool has been helpful in exporting my annotations in a specific format, which has significantly aided me in my work. However, there is another scenario in my annotations that marks well-used words and phrases, and I hope to be able to export these annotations along with their context (i.e., the whole sentence). This would help me better comprehend the meaning and usage of the phrase when I review my notes.
I also have checked the pdfminer
module. It said if we want to extract all of the text. We could do:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
print(element.get_text())
How can I use the annot
to match an element
? Is this possible?
I also have checked the
pdfminer
module. It said if we want to extract all of the text. We could do:from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer for page_layout in extract_pages("test.pdf"): for element in page_layout: if isinstance(element, LTTextContainer): print(element.get_text())
How can I use the
annot
to match anelement
? Is this possible?
That's basically the problem at the core of pdfannots :) Most page elements have x/y coordinates, and each annotation consists of one or more bounding boxes, so the problem mostly boils down to processing the text elements and then checking for intersections between them and the annotation boxes. However, you can't just use LTTextContainers for this as those are too large (e.g. entire boxes or lines), rather you have to look at the characters inside them. The logic for this is in _PDFProcessor.render
and its helpers like test_boxes
.
I will have a try. thank you for your kindly guide
I have done the primary implementation. here is my repo.