pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

LTObjects list elements are not in occurrence order from a PDF Page when there are images and figures in the page

Open praveenkumar-ravising opened this issue 4 years ago • 8 comments

I have a document with images in between the lines of many sizes which I need to extract along with the image as link with the text.

Below is the LT Object order which I get when using PDFPageAggregator and get_result

<LTTextBoxHorizontal(0) 45.800,226.466,246.761,240.446 'Logging Off – Temporary \n'>
<LTTextBoxHorizontal(1) 275.880,241.446,279.407,251.466 ' \n'>
<LTTextBoxHorizontal(2) 20.400,204.216,153.087,214.266 '1.  In the menu bar, click \n'>
<LTTextBoxHorizontal(3) 173.320,204.246,180.507,214.266 '. \n'>
<LTTextBoxHorizontal(4) 20.400,189.096,252.357,199.146 '2.  A confirmation message for Temporary or \n'>
<LTTextBoxHorizontal(5) 38.400,177.006,144.987,187.026 'Full Log off displays. \n'>
<LTTextBoxHorizontal(6) 20.400,161.776,133.767,171.826 '3.  Click Temporary. \n'>
<LTTextBoxHorizontal(7) 20.400,146.596,248.267,156.646 '4.  Log on to the same terminal to continue. \n'>
<LTTextBoxHorizontal(8) 74.180,117.606,218.321,131.586 'Logging Off – Full \n'>
<LTTextBoxHorizontal(9) 20.400,95.296,153.087,105.346 '1.  In the menu bar, click \n'>
<LTTextBoxHorizontal(10) 173.320,95.326,180.507,105.346 '. \n'>
<LTTextBoxHorizontal(11) 20.400,80.156,252.357,90.206 '2.  A confirmation message for Temporary or \n'>
<LTTextBoxHorizontal(12) 38.400,68.066,144.987,78.086 'Full Log off displays. \n'>
<LTTextBoxHorizontal(13) 20.400,52.856,93.187,62.906 '3.  Click Full. \n'>
<LTTextBoxHorizontal(14) 20.400,37.676,159.927,47.726 '4.  Log on to any terminal. \n'>
<LTTextBoxHorizontal(17) 269.220,17.027,276.601,24.047 '2 \n'>
<LTFigure(Image36) 218.850,243.510,275.473,273.860 matrix=[56.62,0.00,0.00,30.35, (218.85,243.51)]>
<LTImage(Image36) 218.850,243.510,275.473,273.860 (125, 67)>
<LTFigure(Image42) 153.050,206.350,173.292,220.500 matrix=[20.24,0.00,0.00,14.15, (153.05,206.35)]>
<LTImage(Image42) 153.050,206.350,173.292,220.500 (56, 39)>
<LTFigure(Image42) 153.050,97.468,173.292,111.618 matrix=[20.24,0.00,0.00,14.15, (153.05,97.47)]>
<LTImage(Image42) 153.050,97.468,173.292,111.618 (56, 39)>

Expecting the image object below in adjacent to 4th or as 5th object of the image tagging as the order of the occurence.

<LTImage(Image42) 153.050,206.350,173.292,220.500 (56, 39)>
<LTTextBoxHorizontal(0) 45.800,226.466,246.761,240.446 'Logging Off – Temporary \n'>
<LTTextBoxHorizontal(1) 275.880,241.446,279.407,251.466 ' \n'>
<LTTextBoxHorizontal(2) 20.400,204.216,153.087,214.266 '1.  In the menu bar, click \n'>
**<LTImage(Image42) 153.050,206.350,173.292,220.500 (56, 39)>**
<LTTextBoxHorizontal(3) 173.320,204.246,180.507,214.266 '. \n'>

I presume this as a bug. Is there any existing way to avoid the text box object grouping ahead of Image or getting the exact order of occurrence of the layout objects.

praveenkumar-ravising avatar Apr 16 '20 07:04 praveenkumar-ravising

Hi @praveenkumar-ravising, could you share the code you are using to generate this order? Depending on the LAParams you are using this could be the expected result.

pietermarsman avatar May 09 '20 13:05 pietermarsman

Hi @pietermarsman, thanks for the response.

Please find my code below which extracting the above,

def ParsePages(doc, images_folder):    
    rsrcmgr = PDFResourceManager()
    laparams = LAParams(all_texts=False, detect_vertical=True)
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)    
    for (i, page) in enumerate(PDFPage.create_pages(doc)):
        text_content = []

        interpreter.process_page(page)
        layout = device.get_result()
        layoutParse(layout, i + 1, images_folder, text_content)
        extractedText = ''.join(list(filter(None, text_content)))        
        jsonFormatData(extractedText, i + 1)        
    return text_content

Actual Result:

<LTTextBoxHorizontal(0) 45.800,226.466,246.761,240.446 'Logging Off – Temporary \n'>
<LTTextBoxHorizontal(1) 275.880,241.446,279.407,251.466 ' \n'>
<LTTextBoxHorizontal(2) 20.400,204.216,153.087,214.266 '1.  In the menu bar, click \n'>
<LTTextBoxHorizontal(3) 173.320,204.246,180.507,214.266 '. \n'>
<LTTextBoxHorizontal(4) 20.400,189.096,252.357,199.146 '2.  A confirmation message for Temporary or \n'>
<LTTextBoxHorizontal(5) 38.400,177.006,144.987,187.026 'Full Log off displays. \n'>
<LTTextBoxHorizontal(6) 20.400,161.776,133.767,171.826 '3.  Click Temporary. \n'>
<LTTextBoxHorizontal(7) 20.400,146.596,248.267,156.646 '4.  Log on to the same terminal to continue. \n'>
<LTTextBoxHorizontal(8) 74.180,117.606,218.321,131.586 'Logging Off – Full \n'>
<LTTextBoxHorizontal(9) 20.400,95.296,153.087,105.346 '1.  In the menu bar, click \n'>
<LTTextBoxHorizontal(10) 173.320,95.326,180.507,105.346 '. \n'>
<LTTextBoxHorizontal(11) 20.400,80.156,252.357,90.206 '2.  A confirmation message for Temporary or \n'>
<LTTextBoxHorizontal(12) 38.400,68.066,144.987,78.086 'Full Log off displays. \n'>
<LTTextBoxHorizontal(13) 20.400,52.856,93.187,62.906 '3.  Click Full. \n'>
<LTTextBoxHorizontal(14) 20.400,37.676,159.927,47.726 '4.  Log on to any terminal. \n'>
<LTTextBoxHorizontal(17) 269.220,17.027,276.601,24.047 '2 \n'>
<LTFigure(Image36) 218.850,243.510,275.473,273.860 matrix=[56.62,0.00,0.00,30.35, (218.85,243.51)]>
<LTImage(Image36) 218.850,243.510,275.473,273.860 (125, 67)>
<LTFigure(Image42) 153.050,206.350,173.292,220.500 matrix=[20.24,0.00,0.00,14.15, (153.05,206.35)]>
<LTImage(Image42) 153.050,206.350,173.292,220.500 (56, 39)>
<LTFigure(Image42) 153.050,97.468,173.292,111.618 matrix=[20.24,0.00,0.00,14.15, (153.05,97.47)]>
<LTImage(Image42) 153.050,97.468,173.292,111.618 (56, 39)>

Expectation from the document:

_<LTFigure(Image36) 218.850,243.510,275.473,273.860 matrix=[56.62,0.00,0.00,30.35, (218.85,243.51)]>
<LTImage(Image36) 218.850,243.510,275.473,273.860 (125, 67)>_
<LTTextBoxHorizontal(0) 45.800,226.466,246.761,240.446 'Logging Off – Temporary \n'>
<LTTextBoxHorizontal(1) 275.880,241.446,279.407,251.466 ' \n'>
<LTTextBoxHorizontal(2) 20.400,204.216,153.087,214.266 '1.  In the menu bar, click \n'>
<LTTextBoxHorizontal(3) 173.320,204.246,180.507,214.266 '. \n'>
_<LTFigure(Image42) 153.050,206.350,173.292,220.500 matrix=[20.24,0.00,0.00,14.15, (153.05,206.35)]>
<LTImage(Image42) 153.050,206.350,173.292,220.500 (56, 39)>_
<LTTextBoxHorizontal(4) 20.400,189.096,252.357,199.146 '2.  A confirmation message for Temporary or \n'>
<LTTextBoxHorizontal(5) 38.400,177.006,144.987,187.026 'Full Log off displays. \n'>
<LTTextBoxHorizontal(6) 20.400,161.776,133.767,171.826 '3.  Click Temporary. \n'>
<LTTextBoxHorizontal(7) 20.400,146.596,248.267,156.646 '4.  Log on to the same terminal to continue. \n'>
<LTTextBoxHorizontal(8) 74.180,117.606,218.321,131.586 'Logging Off – Full \n'>
<LTTextBoxHorizontal(9) 20.400,95.296,153.087,105.346 '1.  In the menu bar, click \n'>
<LTTextBoxHorizontal(10) 173.320,95.326,180.507,105.346 '. \n'>
_<LTFigure(Image42) 153.050,97.468,173.292,111.618 matrix=[20.24,0.00,0.00,14.15, (153.05,97.47)]>
<LTImage(Image42) 153.050,97.468,173.292,111.618 (56, 39)>_
<LTTextBoxHorizontal(11) 20.400,80.156,252.357,90.206 '2.  A confirmation message for Temporary or \n'>
<LTTextBoxHorizontal(12) 38.400,68.066,144.987,78.086 'Full Log off displays. \n'>
<LTTextBoxHorizontal(13) 20.400,52.856,93.187,62.906 '3.  Click Full. \n'>
<LTTextBoxHorizontal(14) 20.400,37.676,159.927,47.726 '4.  Log on to any terminal. \n'>
<LTTextBoxHorizontal(17) 269.220,17.027,276.601,24.047 '2 \n'>

Let me know if I am missing something in LAParams or in my code for the expectation to be matching the document.

praveenkumar-ravising avatar May 18 '20 09:05 praveenkumar-ravising

Hi @praveenkumar-ravising, forgot to ask, can you also share the pdf?

pietermarsman avatar May 21 '20 10:05 pietermarsman

hi @pietermarsman,

I am attaching one sample single page PDF which I created for showcasing my example and below is the extracted layout LTObject which has scrambled text and image order,

<LTTextBoxHorizontal(0) 306.050,679.780,308.545,690.820 ' \n'>
<LTTextBoxHorizontal(1) 540.100,679.780,542.595,690.820 ' \n'>
<LTTextBoxHorizontal(2) 77.400,646.900,229.705,657.940 'est the Image and Text Extraction \n'>
<LTTextBoxHorizontal(3) 72.024,622.900,259.340,633.940 'Placing the text here along with an image \n'>
<LTTextBoxHorizontal(4) 274.370,622.900,276.865,633.940 ' \n'>
<LTTextBoxHorizontal(5) 72.024,600.340,199.945,611.380 'Also wanted some text here \n'>
<LTTextBoxHorizontal(6) 72.024,576.220,316.116,587.260 'And also wanted to repeat the text and an image here \n'>
<LTTextBoxHorizontal(7) 331.270,576.220,333.765,587.260 ' \n'>
<LTTextBoxHorizontal(8) 72.024,679.780,74.519,690.820 ' \n'>
<LTTextBoxVertical(9) 72.024,646.900,77.400,680.380 '  T\n'>
<LTTextBoxVertical(10) 72.024,531.190,74.519,564.700 '   \n'>
<LTTextBoxHorizontal(11) 150.020,445.510,152.515,456.550 ' \n'>
<LTTextBoxHorizontal(12) 72.024,426.070,104.275,437.110 'Footer \n'>
<LTFigure(Image9) 462.000,683.000,539.500,756.000 matrix=[77.50,0.00,0.00,73.00, (462.00,683.00)]>
<LTFigure(Image10) 259.290,625.420,274.232,637.520 matrix=[14.94,0.00,0.00,12.10, (259.29,625.42)]>
<LTFigure(Image11) 316.230,578.810,331.172,590.910 matrix=[14.94,0.00,0.00,12.10, (316.23,578.81)]>
<LTFigure(Image12) 72.000,448.810,149.500,521.810 matrix=[77.50,0.00,0.00,73.00, (72.00,448.81)]>

Same code from the previous comment has been used to extract this document as well. Let me know if any more information required.

TestDocument.pdf

praveenkumar-ravising avatar May 25 '20 04:05 praveenkumar-ravising

hi @pietermarsman , is there any direction for the posted concern!

Thanks in advance.

praveenkumar-ravising avatar Jun 04 '20 08:06 praveenkumar-ravising

i'm facing same issue here, though in my pdf there is no Image . i'm getting disordered LTTextBox .

aninda052 avatar Jun 26 '20 20:06 aninda052

I'm not sure what's going on. I'm also not sure if this is a bug or if the order of objects is an implementation detail that is not guaranteed. I need to look into this, but that can take a while. Feel free to propose a solution if you have one.

pietermarsman avatar Jun 29 '20 18:06 pietermarsman

hello everyone, I am facing a similar issue where the LTTextBoxHorizontal's are in the wrong order. anyone have any solutions? thanks!

rcyost avatar Nov 21 '22 18:11 rcyost