py-pdf-parser icon indicating copy to clipboard operation
py-pdf-parser copied to clipboard

Unable to create an PDFDocument object

Open papstchaka opened this issue 1 year ago • 2 comments

I have a file containing several pages - see bugfile.pdf I now want to extract pages 2 and 3 of this file and add them into a new PDFDocument object. However, this does not seem to be possible. I tried the following

from py_pdf_parser.loaders import load, load_file
from py_pdf_parser.components import PDFPage, PDFDocument

file = "bugfile.pdf"
document = load_file(file)
pages = document.pages[1:3]
new_page_dict = {1:pages[0], 2:pages[1]}
new_document = PDFDocument(pages=pages)

The error I get is AttributeError: 'PDFElement' object has no attribute 'y0' (line 31 in py_pdf_parser.components) which makes absolute sense because the ElementOrdering expects an LTPage object given by pdfminer. Even when submitting a custom sorting function via:

sorting = lambda elements: sorted(elements, key=lambda elem: (-elem.original_element.y0, elem.original_element.x0))
new_document = PDFDocument(pages=pages, element_ordering=sorting)

I get an AttributeError: 'PDFElement' object has no attribute 'x0' (line 147 in py_pdf_parser.components) for calculating the bounding boxes - which I know can't fix anymore with my code. When submitting an LTPage object given by pdfminer though, I get the error AttributeError: 'LTPage' object has no attribute 'elements' (line 413 in py_pdf_parser.components) because this line expects an PDFElement.

It seems that the constructor of PDFDocument cannot be called somehow or at least not in the way I tried it (yet). Does anybody have any ideas for that?

papstchaka avatar May 11 '23 17:05 papstchaka