pdfplumber TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'

TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'

Open loganathanspr opened this issue 1 year ago • 1 comments

Describe the bug

I am extracting annotations from a pdf file. It is giving me the TypeError when accessing the .annots. When I updated each annotations manually (just adding/deleting one extra character ), it didn't give me this error. I am suspecting the original text encoding of the annotation is different than the one expected by the pdfplumber. Does pdfplumber have any strict assumption on the text encoding?

Code to reproduce the problem

def get_pdf_annotations(pdf_path: str):
  """Get all annotations (by page) for a pdf file.

  Args:
    pdf_path (str): Path to pdf file.

  Returns:
    List of annotations: List index corresponds to page numbers (starting from 0)
    and each list item is a list of annotations found for that page.
  """
  annots_all_pages = []
  with pdfplumber.open(pdf_path) as pdf:
    pages = pdf.pages
    for p in pages:
      page_annots = []
      texts = []
      colors = []      
      annotations = p.annots
     # ...
     # ....
  return annots_all_pages

Screenshots

Environment

pdfplumber version: 0.7.4
pdfminer.six version: 20220524
Python version: 3.8, 3.9
OS: Mac, Linux

Sep 07 '22 12:09 loganathanspr

Thanks for flagging @loganathanspr! Looking at the stacktrace, my best guess is that the annotation has an undefined bounding box. (Hence why it'd get such an error on line 167, where the stacktrace is pointing.) But it's a bit difficult to know for sure, or to test a fix, without seeing the actual PDF. Are you able to share that?

Sep 07 '22 12:09 jsvine

pdfplumber pdfplumber copied to clipboard

TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'

Describe the bug

Code to reproduce the problem

Screenshots

Environment

pdfplumber
pdfplumber copied to clipboard