pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

hyperlinks have negative height

Open bentsi opened this issue 2 years ago • 1 comments

Describe the bug

hyperlink height property has negative height value.

Code to reproduce the problem

  1. open pdf
  2. see pdf_file.pages[61].hyperlinks

PDF file

https://www.singtel.com/content/dam/singtel/about-us/sustainability/reports/Singtel-Group-Sustainability-Report-2022.pdf

Expected behavior

height should be positive number

Actual behavior

height has negative value

Screenshots

image

Environment

  • pdfplumber version: 0.7.5
  • Python version: 3.10.5
  • OS: Ubuntu 20.04.5 LTS (Focal Fossa)

Additional context

in addition we can see that "top" and "bottom" attributes are swapped, that doesn't comply with pdfplumber's bounding box definitions as discussed in https://github.com/jsvine/pdfplumber/issues/198

bentsi avatar Mar 22 '23 17:03 bentsi

Hi @bentsi, thanks for sharing this example. The height, top, and bottom attributes are all calculated from the raw annotation's Rect (bounding box), specified by the PDF in a direct command.

In this particular PDF (as observed by opening it in a text editor), that Rect command is Rect[428.053 634.536 453.041 626.144], which corresponds to exactly what you see for x0, y0, x1, y1 in your screenshot above, suggesting that pdfplumber is collecting the correct information.

Given that, there would seem to be two main options:

  • Do nothing, on the principle that pdfplumber should focus on PDF objects' actual (i.e., as coded) attributes, rather than what we think the author intended.

  • When pdfplumber sees an annotation that uses a bounding box that suggests a negative height, "fix" the bounding box (probably by flipping the vertical coordinates) so that it has a positive height.

My inclination is toward the first option, because trying to fix PDF-creator's mistakes seems like opening a can of worms. But I'm open to suggestions otherwise.

jsvine avatar Mar 22 '23 22:03 jsvine