pdfannots icon indicating copy to clipboard operation
pdfannots copied to clipboard

PDF example of truncated highlight

Open tadeoos opened this issue 4 years ago • 2 comments

Hey all, Thank you all for this fantastic script! It works very well, although I found a pdf (attached) whose highlights are being severely truncated. I tweaked boxhit function to return True if there is any overlap at all which gave me better results but then the script still does not pick up the last line of each highlight. It looks like original boxes and the rectangle in the Annotation object are indeed missing this last line (the annotation y0 is bigger than the item's)...

Anyway... I can provide more info if you'd like and I'd very much appreciate any insight into fixing this although it is also possible that it is more of a pdfminer issue...

pwc-tax-guide.pdf

tadeoos avatar Jul 29 '20 17:07 tadeoos

Thanks for the report and sample PDF. I've futzed with the hit detection algorithm quite a few times before, but haven't had any reports of issues with it for a long time so I suspect this may be an issue with the PDF annotation software as much as it is with pdfminer. I could consider making the 0.5 constant tunable, but that sounds like it wouldn't have fully solved your issue (?)

0xabu avatar Jul 30 '20 04:07 0xabu

Thanks, @0xabu for a quick reply! Even with the 0.5 const down to 0, I'm missing some of the letters...

I dug a bit deeper and I believe it is a pdfminer.six issue. I filed (or rather commented an existing) issue there. Just in case, I'm leaving a link here: https://github.com/pdfminer/pdfminer.six/issues/281#issuecomment-670097274

tadeoos avatar Aug 11 '20 15:08 tadeoos