pdfannots
pdfannots copied to clipboard
PDF example of truncated highlight
Hey all,
Thank you all for this fantastic script! It works very well, although I found a pdf (attached) whose highlights are being severely truncated. I tweaked boxhit
function to return True
if there is any overlap at all which gave me better results but then the script still does not pick up the last line of each highlight. It looks like original boxes and the rectangle in the Annotation object are indeed missing this last line (the annotation y0 is bigger than the item's)...
Anyway... I can provide more info if you'd like and I'd very much appreciate any insight into fixing this although it is also possible that it is more of a pdfminer issue...
Thanks for the report and sample PDF. I've futzed with the hit detection algorithm quite a few times before, but haven't had any reports of issues with it for a long time so I suspect this may be an issue with the PDF annotation software as much as it is with pdfminer. I could consider making the 0.5 constant tunable, but that sounds like it wouldn't have fully solved your issue (?)
Thanks, @0xabu for a quick reply! Even with the 0.5 const down to 0, I'm missing some of the letters...
I dug a bit deeper and I believe it is a pdfminer.six issue. I filed (or rather commented an existing) issue there. Just in case, I'm leaving a link here: https://github.com/pdfminer/pdfminer.six/issues/281#issuecomment-670097274