pdfminer.six
pdfminer.six copied to clipboard
LTObjects list elements are not in occurrence order from a PDF Page when there are images and figures in the page
I have a document with images in between the lines of many sizes which I need to extract along with the image as link with the text.
Below is the LT Object order which I get when using PDFPageAggregator and get_result
<LTTextBoxHorizontal(0) 45.800,226.466,246.761,240.446 'Logging Off – Temporary \n'>
<LTTextBoxHorizontal(1) 275.880,241.446,279.407,251.466 ' \n'>
<LTTextBoxHorizontal(2) 20.400,204.216,153.087,214.266 '1. In the menu bar, click \n'>
<LTTextBoxHorizontal(3) 173.320,204.246,180.507,214.266 '. \n'>
<LTTextBoxHorizontal(4) 20.400,189.096,252.357,199.146 '2. A confirmation message for Temporary or \n'>
<LTTextBoxHorizontal(5) 38.400,177.006,144.987,187.026 'Full Log off displays. \n'>
<LTTextBoxHorizontal(6) 20.400,161.776,133.767,171.826 '3. Click Temporary. \n'>
<LTTextBoxHorizontal(7) 20.400,146.596,248.267,156.646 '4. Log on to the same terminal to continue. \n'>
<LTTextBoxHorizontal(8) 74.180,117.606,218.321,131.586 'Logging Off – Full \n'>
<LTTextBoxHorizontal(9) 20.400,95.296,153.087,105.346 '1. In the menu bar, click \n'>
<LTTextBoxHorizontal(10) 173.320,95.326,180.507,105.346 '. \n'>
<LTTextBoxHorizontal(11) 20.400,80.156,252.357,90.206 '2. A confirmation message for Temporary or \n'>
<LTTextBoxHorizontal(12) 38.400,68.066,144.987,78.086 'Full Log off displays. \n'>
<LTTextBoxHorizontal(13) 20.400,52.856,93.187,62.906 '3. Click Full. \n'>
<LTTextBoxHorizontal(14) 20.400,37.676,159.927,47.726 '4. Log on to any terminal. \n'>
<LTTextBoxHorizontal(17) 269.220,17.027,276.601,24.047 '2 \n'>
<LTFigure(Image36) 218.850,243.510,275.473,273.860 matrix=[56.62,0.00,0.00,30.35, (218.85,243.51)]>
<LTImage(Image36) 218.850,243.510,275.473,273.860 (125, 67)>
<LTFigure(Image42) 153.050,206.350,173.292,220.500 matrix=[20.24,0.00,0.00,14.15, (153.05,206.35)]>
<LTImage(Image42) 153.050,206.350,173.292,220.500 (56, 39)>
<LTFigure(Image42) 153.050,97.468,173.292,111.618 matrix=[20.24,0.00,0.00,14.15, (153.05,97.47)]>
<LTImage(Image42) 153.050,97.468,173.292,111.618 (56, 39)>
Expecting the image object below in adjacent to 4th or as 5th object of the image tagging as the order of the occurence.
<LTImage(Image42) 153.050,206.350,173.292,220.500 (56, 39)>
<LTTextBoxHorizontal(0) 45.800,226.466,246.761,240.446 'Logging Off – Temporary \n'>
<LTTextBoxHorizontal(1) 275.880,241.446,279.407,251.466 ' \n'>
<LTTextBoxHorizontal(2) 20.400,204.216,153.087,214.266 '1. In the menu bar, click \n'>
**<LTImage(Image42) 153.050,206.350,173.292,220.500 (56, 39)>**
<LTTextBoxHorizontal(3) 173.320,204.246,180.507,214.266 '. \n'>
I presume this as a bug. Is there any existing way to avoid the text box object grouping ahead of Image or getting the exact order of occurrence of the layout objects.
Hi @praveenkumar-ravising, could you share the code you are using to generate this order? Depending on the LAParams
you are using this could be the expected result.
Hi @pietermarsman, thanks for the response.
Please find my code below which extracting the above,
def ParsePages(doc, images_folder):
rsrcmgr = PDFResourceManager()
laparams = LAParams(all_texts=False, detect_vertical=True)
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for (i, page) in enumerate(PDFPage.create_pages(doc)):
text_content = []
interpreter.process_page(page)
layout = device.get_result()
layoutParse(layout, i + 1, images_folder, text_content)
extractedText = ''.join(list(filter(None, text_content)))
jsonFormatData(extractedText, i + 1)
return text_content
Actual Result:
<LTTextBoxHorizontal(0) 45.800,226.466,246.761,240.446 'Logging Off – Temporary \n'>
<LTTextBoxHorizontal(1) 275.880,241.446,279.407,251.466 ' \n'>
<LTTextBoxHorizontal(2) 20.400,204.216,153.087,214.266 '1. In the menu bar, click \n'>
<LTTextBoxHorizontal(3) 173.320,204.246,180.507,214.266 '. \n'>
<LTTextBoxHorizontal(4) 20.400,189.096,252.357,199.146 '2. A confirmation message for Temporary or \n'>
<LTTextBoxHorizontal(5) 38.400,177.006,144.987,187.026 'Full Log off displays. \n'>
<LTTextBoxHorizontal(6) 20.400,161.776,133.767,171.826 '3. Click Temporary. \n'>
<LTTextBoxHorizontal(7) 20.400,146.596,248.267,156.646 '4. Log on to the same terminal to continue. \n'>
<LTTextBoxHorizontal(8) 74.180,117.606,218.321,131.586 'Logging Off – Full \n'>
<LTTextBoxHorizontal(9) 20.400,95.296,153.087,105.346 '1. In the menu bar, click \n'>
<LTTextBoxHorizontal(10) 173.320,95.326,180.507,105.346 '. \n'>
<LTTextBoxHorizontal(11) 20.400,80.156,252.357,90.206 '2. A confirmation message for Temporary or \n'>
<LTTextBoxHorizontal(12) 38.400,68.066,144.987,78.086 'Full Log off displays. \n'>
<LTTextBoxHorizontal(13) 20.400,52.856,93.187,62.906 '3. Click Full. \n'>
<LTTextBoxHorizontal(14) 20.400,37.676,159.927,47.726 '4. Log on to any terminal. \n'>
<LTTextBoxHorizontal(17) 269.220,17.027,276.601,24.047 '2 \n'>
<LTFigure(Image36) 218.850,243.510,275.473,273.860 matrix=[56.62,0.00,0.00,30.35, (218.85,243.51)]>
<LTImage(Image36) 218.850,243.510,275.473,273.860 (125, 67)>
<LTFigure(Image42) 153.050,206.350,173.292,220.500 matrix=[20.24,0.00,0.00,14.15, (153.05,206.35)]>
<LTImage(Image42) 153.050,206.350,173.292,220.500 (56, 39)>
<LTFigure(Image42) 153.050,97.468,173.292,111.618 matrix=[20.24,0.00,0.00,14.15, (153.05,97.47)]>
<LTImage(Image42) 153.050,97.468,173.292,111.618 (56, 39)>
Expectation from the document:
_<LTFigure(Image36) 218.850,243.510,275.473,273.860 matrix=[56.62,0.00,0.00,30.35, (218.85,243.51)]>
<LTImage(Image36) 218.850,243.510,275.473,273.860 (125, 67)>_
<LTTextBoxHorizontal(0) 45.800,226.466,246.761,240.446 'Logging Off – Temporary \n'>
<LTTextBoxHorizontal(1) 275.880,241.446,279.407,251.466 ' \n'>
<LTTextBoxHorizontal(2) 20.400,204.216,153.087,214.266 '1. In the menu bar, click \n'>
<LTTextBoxHorizontal(3) 173.320,204.246,180.507,214.266 '. \n'>
_<LTFigure(Image42) 153.050,206.350,173.292,220.500 matrix=[20.24,0.00,0.00,14.15, (153.05,206.35)]>
<LTImage(Image42) 153.050,206.350,173.292,220.500 (56, 39)>_
<LTTextBoxHorizontal(4) 20.400,189.096,252.357,199.146 '2. A confirmation message for Temporary or \n'>
<LTTextBoxHorizontal(5) 38.400,177.006,144.987,187.026 'Full Log off displays. \n'>
<LTTextBoxHorizontal(6) 20.400,161.776,133.767,171.826 '3. Click Temporary. \n'>
<LTTextBoxHorizontal(7) 20.400,146.596,248.267,156.646 '4. Log on to the same terminal to continue. \n'>
<LTTextBoxHorizontal(8) 74.180,117.606,218.321,131.586 'Logging Off – Full \n'>
<LTTextBoxHorizontal(9) 20.400,95.296,153.087,105.346 '1. In the menu bar, click \n'>
<LTTextBoxHorizontal(10) 173.320,95.326,180.507,105.346 '. \n'>
_<LTFigure(Image42) 153.050,97.468,173.292,111.618 matrix=[20.24,0.00,0.00,14.15, (153.05,97.47)]>
<LTImage(Image42) 153.050,97.468,173.292,111.618 (56, 39)>_
<LTTextBoxHorizontal(11) 20.400,80.156,252.357,90.206 '2. A confirmation message for Temporary or \n'>
<LTTextBoxHorizontal(12) 38.400,68.066,144.987,78.086 'Full Log off displays. \n'>
<LTTextBoxHorizontal(13) 20.400,52.856,93.187,62.906 '3. Click Full. \n'>
<LTTextBoxHorizontal(14) 20.400,37.676,159.927,47.726 '4. Log on to any terminal. \n'>
<LTTextBoxHorizontal(17) 269.220,17.027,276.601,24.047 '2 \n'>
Let me know if I am missing something in LAParams or in my code for the expectation to be matching the document.
Hi @praveenkumar-ravising, forgot to ask, can you also share the pdf?
hi @pietermarsman,
I am attaching one sample single page PDF which I created for showcasing my example and below is the extracted layout LTObject which has scrambled text and image order,
<LTTextBoxHorizontal(0) 306.050,679.780,308.545,690.820 ' \n'>
<LTTextBoxHorizontal(1) 540.100,679.780,542.595,690.820 ' \n'>
<LTTextBoxHorizontal(2) 77.400,646.900,229.705,657.940 'est the Image and Text Extraction \n'>
<LTTextBoxHorizontal(3) 72.024,622.900,259.340,633.940 'Placing the text here along with an image \n'>
<LTTextBoxHorizontal(4) 274.370,622.900,276.865,633.940 ' \n'>
<LTTextBoxHorizontal(5) 72.024,600.340,199.945,611.380 'Also wanted some text here \n'>
<LTTextBoxHorizontal(6) 72.024,576.220,316.116,587.260 'And also wanted to repeat the text and an image here \n'>
<LTTextBoxHorizontal(7) 331.270,576.220,333.765,587.260 ' \n'>
<LTTextBoxHorizontal(8) 72.024,679.780,74.519,690.820 ' \n'>
<LTTextBoxVertical(9) 72.024,646.900,77.400,680.380 ' T\n'>
<LTTextBoxVertical(10) 72.024,531.190,74.519,564.700 ' \n'>
<LTTextBoxHorizontal(11) 150.020,445.510,152.515,456.550 ' \n'>
<LTTextBoxHorizontal(12) 72.024,426.070,104.275,437.110 'Footer \n'>
<LTFigure(Image9) 462.000,683.000,539.500,756.000 matrix=[77.50,0.00,0.00,73.00, (462.00,683.00)]>
<LTFigure(Image10) 259.290,625.420,274.232,637.520 matrix=[14.94,0.00,0.00,12.10, (259.29,625.42)]>
<LTFigure(Image11) 316.230,578.810,331.172,590.910 matrix=[14.94,0.00,0.00,12.10, (316.23,578.81)]>
<LTFigure(Image12) 72.000,448.810,149.500,521.810 matrix=[77.50,0.00,0.00,73.00, (72.00,448.81)]>
Same code from the previous comment has been used to extract this document as well. Let me know if any more information required.
hi @pietermarsman , is there any direction for the posted concern!
Thanks in advance.
i'm facing same issue here, though in my pdf there is no Image . i'm getting disordered LTTextBox .
I'm not sure what's going on. I'm also not sure if this is a bug or if the order of objects is an implementation detail that is not guaranteed. I need to look into this, but that can take a while. Feel free to propose a solution if you have one.
hello everyone, I am facing a similar issue where the LTTextBoxHorizontal's are in the wrong order. anyone have any solutions? thanks!