pdfminer.six
pdfminer.six copied to clipboard
accessing PDFpage.cropbox from LTPage
I have a PDF from which I want to extract some text:
One characteristic of this PDF is that it has a lot of text outside of the page boundary. I usually handle this kind of case by checking the coordinates of every item against the boundaries of the corresponding LTPage. (Just FYI, I'm doing that in a custom TextConverter)
This PDF is a little different from the usual ones in the sense that I want to limit the text extraction to the cropbox and not the mediabox. Here's an extract from the PDF:
<< /BleedBox [ 0 0 842 1191 ] /Contents 5 0 R /CropBox [ 40 8 245 1182 ] /MediaBox [ 0 0 842 1191 ] /Parent 3 0 R /Resources 6 0 R /Rotate 90 /StructParents 2 /Type /Page >>
It seems that LTPage's bbox is the mediabox (as seen here in the code), and the cropbox seems unaccessible. Is it possible to add .bleedbox, .cropbox and .mediabox to LTPage or maybe a pointer to the PDFPage?
I can provide a PR but I need a bit of guidance on the best solution