pdfminer.six pdfminer fails to extract text and co-ordinates from fields in a non-editable (i.e. flattened) PDF form

pdfminer fails to extract text and co-ordinates from fields in a non-editable (i.e. flattened) PDF form

Open AIMLAPP opened this issue 3 years ago • 2 comments

Hi,

I am trying to extract all words/text as well as the co-ordinates of each word using pdfminer from filled in PDF forms that are no longer editable (i.e. they are flattened and NOT acroforms). I am only able to extract text and co-ordinates outside the fields. E.g. on the attached image, "... CAPITAL LETTERS or tick ✓ as necessary." can be extracted. But "Disneyland", "Mickey" etc can't.

As a result, with the code I am using, the words & co-ordinates extracted from a blank form, filled in Acroform, and non-editable pdf form are exactly the same due to this issue.

Is there any way to resolve this using pdfminer or any alternative packages (in the case that it cannot be resolved by pdfminer)?

The sample PDF can be found here: https://drive.google.com/file/d/1HroGrPqADRQ0_ccsIP6wHmqof0ghTdVZ/view

Here is the code:

from pdfminer.layout import LAParams, LTTextBox, LTText, LTChar, LTAnno
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.converter import PDFPageAggregator

fp = open('sample.pdf', 'rb')
manager = PDFResourceManager() 
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev) 
pages = PDFPage.get_pages(fp)

count = 0
x_list, y_list, x1_list, y1_list,text_list = [],[],[],[],[]
for page in pages:
    print('--- Processing Page ---')

    
    interpreter.process_page(page)
    layout = dev.get_result()
    x, y, x1, y1, text = -1, -1, -1, -1,''
    for textbox in layout:
        if isinstance(textbox, LTText):
          for line in textbox:
            for char in line:
              if isinstance(char, LTAnno) or char.get_text() == ' ':
                if x != -1:
                  print('At %r is text: %s' % ((x, y, x1, y1), text))
                  x_list.append(x)
                  y_list.append(y)
                  x1_list.append(x1)
                  y1_list.append(y1)
                  text_list.append(text)

                x, y, x1, y1, text = -1, -1, -1, -1, ''     
              elif isinstance(char, LTChar):
                text += char.get_text()
                if x == -1:
                  x, y, x1, y1 = char.bbox[0], char.bbox[3], char.bbox[2], char.bbox[1]                                     
                  
    if x != -1:
      print('At %r is text: %s' % ((x, y, x1, y1), text))
      x_list.append(x)
      y_list.append(y)
      x1_list.append(x1)
      y1_list.append(y1)
      text_list.append(text)

Apr 29 '21 16:04 AIMLAPP

Hi, this can be solved through Konfuzio SDK Here is a snapshot of the result from our webinterface. You can sign up for free.

May 14 '21 14:05 jabzer-research

@AIMLAPP did you find any solution to this flattened PDF problem?

Aug 31 '21 08:08 duskybomb

pdfminer.six pdfminer.six copied to clipboard

pdfminer fails to extract text and co-ordinates from fields in a non-editable (i.e. flattened) PDF form

pdfminer.six
pdfminer.six copied to clipboard