pdfminer.six extract_text mixes lines

Hi,

I am not able to find any combination of LAParams to correctly convert attached simple PDF to text. In the resulting text lines do not have correct sequence:

Expected result

============================================================ date1 TEXT11 TEXT 12 TEXT 13 num1 Num2 TEXT21 TEXT 22 date2 text31

=============================================================

actual result:

====================================================== date1 TEXT11 TEXT 12 TEXT 13 num1 Num2

date2 text31

TEXT21 TEXT 22

======================================================

So, the text "text31" and "TEXT21 TEXT 22" are swapped with each other.

excel_sim.pdf excel_sim.xlsx

Aug 01 '20 21:08 Ev2geny

Hi @Ev2geny,

I've had a quick look, and I think boxes_flow=None and line_margin=0.1 is right for you.

To explain, you need a slightly smaller line margin than default to ensure the TEXT21 TEXT 22 and the text31 are split into different lines (since they are quite close). boxes_flow=None disables the advanced layout analysis, which isn't useful for you.

The following code:

from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams

print(
    extract_text("excel_sim.pdf", laparams=LAParams(line_margin=0.1, boxes_flow=None))
)

produces the following output:

date1

TEXT11 TEXT 12 TEXT 13

num1

Num2

TEXT21 TEXT 22
text31

date2

The only thing that differs from your expected output is some additional new lines here, and also date2 and text31 are swapped.

When boxes_flow=None, elements are ordered by their location, left to right top to bottom. The measurements are taken from the bottom of the element, and from the left hand edge. In your case, the bottom of date2 is 689.54 and the bottom of text31 is 689.56. Since these are from the bottom of the PDF, that means that date2 is actually slightly underneath text31.

One way to get around this is to increase the char_margin so that rather than each of your items being in an individual text box, you have each of the three lines as a single text box. If you add char_margin=100, you get the following output:

date1 TEXT11 TEXT 12 TEXT 13 num1 Num2

TEXT21 TEXT 22

date2 text31

If it helps, the text boxes for my first output are as follows:

And the text boxes for my second output (with increased char_margin) are as follows:

I hope that makes sense and helps. Please do let me know so we can close the issue if it's resolved.

Aug 03 '20 08:08 jstockwin

@jstockwin , thank you very much! This helps!

Aug 03 '20 16:08 Ev2geny

@jstockwin, sorry for coming back to this, but I finally managed to reduce bank extract to ping-point the problem, by editing out all confidential information.

See attached sberbank_reduced_to_problem_EN.pdf

Is there any way to tweak pdf miner to produce the following output out of the attached file ?:

======================================================

26.07.2019 02:04    AAAAAAAAAAAAAAAAAAAAAAAAAAA    750,00    -750,00 
BBBBBBBBBBBBB 
05.08.2019 / -    Ccccccccccccccccccccccc

=======================================================

With the parameters, suggested by you ( line_margin=0.1, char_margin=100, boxes_flow= None ) I still get lines mixed in my example:

========================================

  AAAAAAAAAAAAAAAAAAAAAAAAAAA 

26.07.2019 02:04 750,00 -750,00

                 

BBBBBBBBBBBBB

05.08.2019 / -

Ccccccccccccccccccccccc

========================================

Aug 28 '20 22:08 Ev2geny

I get good resutls with line_margin=-1 and boxes_flow=None:

>>> print(extract_text("sberbank_reduced_to_problem_EN.pdf", laparams=LAParams(line_margin=-1, boxes_flow=None)))
26.07.2019 02:04

AAAAAAAAAAAAAAAAAAAAAAAAAAA

                 

750,00

-750,00

BBBBBBBBBBBBB

05.08.2019 / -

Ccccccccccccccccccccccc

(Note that the output is in exactly the expected order, only with blank spaces inserted in between.)

Sep 13 '20 10:09 pietermarsman

@pietermarsman , thanks for looking at this. even in your example I still didn't get exactly the output I needed. Namely: some of the elements go on the new line (compare with the desired output)

I found, that playing with the char_margin I can put 750,00 and -750,00 on the same line. So, if 10 <= char_margin <= 56, I get the following:

print(extract_text(pdf_file_name, laparams=LAParams(line_margin = -1, char_margin=56.0, boxes_flow=None)))

result:

26.07.2019 02:04

AAAAAAAAAAAAAAAAAAAAAAAAAAA   # <= Problem: moved to new line

                 

750,00 -750,00            # <= Problem: moved to new line

BBBBBBBBBBBBB  

05.08.2019 / -

Ccccccccccccccccccccccc  # <= Problem: moved to new line

However if I increase char_margin any further ( 57<=char_margin), it also mixes lines, but it never puts everything on the same line, which should be on the same line

print(extract_text(pdf_file_name, laparams=LAParams(line_margin = -1, char_margin=57.0, boxes_flow=None)))

result:

  AAAAAAAAAAAAAAAAAAAAAAAAAAA # <= Problem. Taken out of line

26.07.2019 02:04 750,00 -750,00

                 

BBBBBBBBBBBBB

05.08.2019 / -

Ccccccccccccccccccccccc # <= Problem: put on the new line

So, Is there any way, a desired output can be achieved?

For myself I solved the problem by creating my own function, which extracts text the way I need, but it would be nice to have similar functionality directly in pdf miner, as I do not think I need something special

Sep 14 '20 13:09 Ev2geny

I cannot find any setting that gets the desired output. But I'm not sure why a low line-margin and a high char-margin don't give the expected result. In my head, this should merge everything horizontally and keep everything vertically separate. But it doesn't. So this needs some thorough analysis.

Sep 17 '20 19:09 pietermarsman

Ok, thanks for the answer. I have resolved my immediate need with my own code.) I will also try to look at what is going wrong. Can you mark this issue as a bug meanwhile may be?

Sep 18 '20 20:09 Ev2geny