pdfminer icon indicating copy to clipboard operation
pdfminer copied to clipboard

some time, the order of return text is not left to right, top to bottom

Open gbrault opened this issue 6 years ago • 3 comments

When parsing a pdf text I would like to have it retruned left to right, top to bottom: it's not the case. I have created a function to do that an will push a pull request!

gbrault avatar Feb 10 '20 07:02 gbrault

When parsing a pdf text I would like to have it retruned left to right, top to bottom: it's not the case. I have created a function to do that an will push a pull request!

Did you push the request? I can't find it...I have the same issue

softskills244 avatar Feb 18 '20 11:02 softskills244

yes I did: I will do it again

gbrault avatar Feb 21 '20 06:02 gbrault

Or best just put the code of the function here `def lrtd_parse_page(document,callback,context): """Parse page and yield text token left to right and top down (lrtd)

:param document: open stream to the PDF file to be worked on
:param callback: a function to callback each tim a page is processed accept 3 parameters 
     p the page, starts at 0 
     selts the lrtb token list and context.
     selts is a list of objects {"x1":x1,"y1":y1,"x0":x0,"y0":y0,"txt":text}
     (x0,y0),(x1,y1) are the coordinates of the box surrounding the object
     if callback return False, lrtd_parse_page quits
:param context: a caller context object for storing some data and passed to callback
"""
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
p=0
for page in PDFPage.get_pages(document):
        interpreter.process_page(page)
        layout = device.get_result()
        elts=[]
        m = 0
        mindeltaheight = 1000
        """
        breakdown lines into string (w/o \n) calculate new coordinates
        """
        for element in layout:
            if isinstance(element, LTTextBoxHorizontal):
                x0 = element.x0
                y0 = element.y0
                x1 = element.x1
                y1 = element.y1
                lines = element.get_text().splitlines()
                lenlines = len(lines)
                j = 0
                deltaheight = (y1-y0)/lenlines
                if mindeltaheight > deltaheight:
                    mindeltaheight = deltaheight
                for line in lines:
                    x0j = x0
                    y0j = y1 - (1+j)*deltaheight
                    x1j = x1
                    y1j = y1 - j*deltaheight
                    j += 1
                    elts.append({"x1":x1j,"y1":y1j,"x0":x0j,"y0":y0j,"txt":line})
                if m < element.y1:
                    m = element.y1
        n = len(elts)
        """
        tune strings coordinate to get them aligned in the same "line" if not too far apart 
        (less than 1/2 the min line height of the page)
        """
        for i in range(1,n):
            for j in range(i+1,n):
                if abs(elts[i-1]["y0"]-elts[j-1]["y0"])<(mindeltaheight/2):
                    elts[j-1]["y0"] = elts[i-1]["y0"]
                if abs(elts[i-1]["y1"]-elts[j-1]["y1"])<(mindeltaheight/2):
                    elts[j-1]["y1"] = elts[i-1]["y1"]
                
        selts = sorted(elts, key=lambda item: (round((2*m-item['y0']-item['y1'])/2,0),round(item['x0'],0)))   
        
        # for elt in selts:
        #    print(f"({p})({elt['x0']:0.0f},{elt['y0']:0.0f},{elt['x1']:0.0f},{elt['y1']:0.0f}){elt['txt']}")
        if not callback(p,selts,context):
            return
            
        p +=1`

I put this function in high_level.py. It does the following thinks:

  • for the text block with multiple lines it breaks them into more single lines making interpolation
  • It cluster the (x,y,x1,y1,text) tuples to have them belong to 'lines'
  • it sort them left to right, top to bottom

Not 100% bullet proof but got good results over many sampled pdf files

gbrault avatar Feb 21 '20 06:02 gbrault