pdfminer.six
pdfminer.six copied to clipboard
Getting KeyError: 'N' in extract_pages when pdf file contains Note.
- PDFMiner not working when pdf contains notes within it. Please find the logs below:
File "/workspace/treeClassifier.py", line 56, in extract_features
for page_number, page_layout in enumerate(extract_pages(pdf_file_obj)):
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pdfminer/high_level.py", line 211, in extract_pages
interpreter.process_page(page)
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 997, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 1014, in render_contents
self.init_resources(resources)
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 387, in init_resources
colorspace = get_colorspace(resolve1(spec))
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 370, in get_colorspace
return PDFColorSpace(name, stream_value(spec[1])["N"])
File "/layers/google.python.pip/pip/lib/python3.9/site-packages/pdfminer/pdftypes.py", line 285, in __getitem__
return self.attrs[name]
KeyError: 'N'
Can you share the code / command and the PDF you are using?
This isn't related to the notes, it's an invalid or corrupted PDF. An ICCBased colour space is required to have the form [/ICCBased stream], and the stream must contain an N entry in its stream dictionary.
Given that colour spaces are not really used by pdfminer anyway we could probably just catch the exception in this case.