pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

accessing PDFpage.cropbox from LTPage

Open eroux opened this issue 2 years ago • 0 comments

I have a PDF from which I want to extract some text:

v7p6-uncompressed.pdf

One characteristic of this PDF is that it has a lot of text outside of the page boundary. I usually handle this kind of case by checking the coordinates of every item against the boundaries of the corresponding LTPage. (Just FYI, I'm doing that in a custom TextConverter)

This PDF is a little different from the usual ones in the sense that I want to limit the text extraction to the cropbox and not the mediabox. Here's an extract from the PDF:

<< /BleedBox [ 0 0 842 1191 ] /Contents 5 0 R /CropBox [ 40 8 245 1182 ] /MediaBox [ 0 0 842 1191 ] /Parent 3 0 R /Resources 6 0 R /Rotate 90 /StructParents 2 /Type /Page >>

It seems that LTPage's bbox is the mediabox (as seen here in the code), and the cropbox seems unaccessible. Is it possible to add .bleedbox, .cropbox and .mediabox to LTPage or maybe a pointer to the PDFPage?

I can provide a PR but I need a bit of guidance on the best solution

eroux avatar Jul 18 '23 17:07 eroux