py-pdf-parser
py-pdf-parser copied to clipboard
Unable to create an PDFDocument object
I have a file containing several pages - see bugfile.pdf
I now want to extract pages 2 and 3 of this file and add them into a new PDFDocument
object. However, this does not seem to be possible. I tried the following
from py_pdf_parser.loaders import load, load_file
from py_pdf_parser.components import PDFPage, PDFDocument
file = "bugfile.pdf"
document = load_file(file)
pages = document.pages[1:3]
new_page_dict = {1:pages[0], 2:pages[1]}
new_document = PDFDocument(pages=pages)
The error I get is AttributeError: 'PDFElement' object has no attribute 'y0'
(line 31 in py_pdf_parser.components) which makes absolute sense because the ElementOrdering expects an LTPage
object given by pdfminer. Even when submitting a custom sorting function via:
sorting = lambda elements: sorted(elements, key=lambda elem: (-elem.original_element.y0, elem.original_element.x0))
new_document = PDFDocument(pages=pages, element_ordering=sorting)
I get an AttributeError: 'PDFElement' object has no attribute 'x0'
(line 147 in py_pdf_parser.components) for calculating the bounding boxes - which I know can't fix anymore with my code.
When submitting an LTPage
object given by pdfminer though, I get the error AttributeError: 'LTPage' object has no attribute 'elements'
(line 413 in py_pdf_parser.components) because this line expects an PDFElement.
It seems that the constructor of PDFDocument cannot be called somehow or at least not in the way I tried it (yet). Does anybody have any ideas for that?