pdfplumber
pdfplumber copied to clipboard
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
Describe the bug
I am extracting annotations from a pdf file. It is giving me the TypeError
when accessing the .annots
. When I updated each annotations manually (just adding/deleting one extra character ), it didn't give me this error. I am suspecting the original text encoding of the annotation is different than the one expected by the pdfplumber
.
Does pdfplumber
have any strict assumption on the text encoding?
Code to reproduce the problem
def get_pdf_annotations(pdf_path: str):
"""Get all annotations (by page) for a pdf file.
Args:
pdf_path (str): Path to pdf file.
Returns:
List of annotations: List index corresponds to page numbers (starting from 0)
and each list item is a list of annotations found for that page.
"""
annots_all_pages = []
with pdfplumber.open(pdf_path) as pdf:
pages = pdf.pages
for p in pages:
page_annots = []
texts = []
colors = []
annotations = p.annots
# ...
# ....
return annots_all_pages
Screenshots
![Screenshot 2022-09-07 at 14 16 50](https://user-images.githubusercontent.com/1277579/188878626-3c8af682-ac8b-4178-ab08-c868b94d1cd0.png)
Environment
- pdfplumber version: 0.7.4
- pdfminer.six version: 20220524
- Python version: 3.8, 3.9
- OS: Mac, Linux
Thanks for flagging @loganathanspr! Looking at the stacktrace, my best guess is that the annotation has an undefined bounding box. (Hence why it'd get such an error on line 167, where the stacktrace is pointing.) But it's a bit difficult to know for sure, or to test a fix, without seeing the actual PDF. Are you able to share that?