extract_text mixes lines
Hi,
I am not able to find any combination of LAParams to correctly convert attached simple PDF to text. In the resulting text lines do not have correct sequence:
Expected result
============================================================ date1 TEXT11 TEXT 12 TEXT 13 num1 Num2 TEXT21 TEXT 22 date2 text31
=============================================================
actual result:
====================================================== date1 TEXT11 TEXT 12 TEXT 13 num1 Num2
date2 text31
TEXT21 TEXT 22
======================================================
So, the text "text31" and "TEXT21 TEXT 22" are swapped with each other.
Hi @Ev2geny,
I've had a quick look, and I think boxes_flow=None and line_margin=0.1 is right for you.
To explain, you need a slightly smaller line margin than default to ensure the TEXT21 TEXT 22 and the text31 are split into different lines (since they are quite close). boxes_flow=None disables the advanced layout analysis, which isn't useful for you.
The following code:
from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams
print(
extract_text("excel_sim.pdf", laparams=LAParams(line_margin=0.1, boxes_flow=None))
)
produces the following output:
date1
TEXT11 TEXT 12 TEXT 13
num1
Num2
TEXT21 TEXT 22
text31
date2
The only thing that differs from your expected output is some additional new lines here, and also date2 and text31 are swapped.
When boxes_flow=None, elements are ordered by their location, left to right top to bottom. The measurements are taken from the bottom of the element, and from the left hand edge. In your case, the bottom of date2 is 689.54 and the bottom of text31 is 689.56. Since these are from the bottom of the PDF, that means that date2 is actually slightly underneath text31.
One way to get around this is to increase the char_margin so that rather than each of your items being in an individual text box, you have each of the three lines as a single text box. If you add char_margin=100, you get the following output:
date1 TEXT11 TEXT 12 TEXT 13 num1 Num2
TEXT21 TEXT 22
date2 text31
If it helps, the text boxes for my first output are as follows:

And the text boxes for my second output (with increased char_margin) are as follows:

I hope that makes sense and helps. Please do let me know so we can close the issue if it's resolved.
@jstockwin , thank you very much! This helps!
@jstockwin, sorry for coming back to this, but I finally managed to reduce bank extract to ping-point the problem, by editing out all confidential information.
See attached sberbank_reduced_to_problem_EN.pdf
Is there any way to tweak pdf miner to produce the following output out of the attached file ?:
======================================================
26.07.2019 02:04 AAAAAAAAAAAAAAAAAAAAAAAAAAA 750,00 -750,00
BBBBBBBBBBBBB
05.08.2019 / - Ccccccccccccccccccccccc
=======================================================
With the parameters, suggested by you ( line_margin=0.1, char_margin=100, boxes_flow= None ) I still get lines mixed in my example:
========================================
AAAAAAAAAAAAAAAAAAAAAAAAAAA
26.07.2019 02:04 750,00 -750,00
BBBBBBBBBBBBB
05.08.2019 / -
Ccccccccccccccccccccccc
========================================
I get good resutls with line_margin=-1 and boxes_flow=None:
>>> print(extract_text("sberbank_reduced_to_problem_EN.pdf", laparams=LAParams(line_margin=-1, boxes_flow=None)))
26.07.2019 02:04
AAAAAAAAAAAAAAAAAAAAAAAAAAA
750,00
-750,00
BBBBBBBBBBBBB
05.08.2019 / -
Ccccccccccccccccccccccc
(Note that the output is in exactly the expected order, only with blank spaces inserted in between.)
@pietermarsman , thanks for looking at this. even in your example I still didn't get exactly the output I needed. Namely: some of the elements go on the new line (compare with the desired output)
I found, that playing with the char_margin I can put 750,00 and -750,00 on the same line. So, if 10 <= char_margin <= 56, I get the following:
print(extract_text(pdf_file_name, laparams=LAParams(line_margin = -1, char_margin=56.0, boxes_flow=None)))
result:
26.07.2019 02:04
AAAAAAAAAAAAAAAAAAAAAAAAAAA # <= Problem: moved to new line
750,00 -750,00 # <= Problem: moved to new line
BBBBBBBBBBBBB
05.08.2019 / -
Ccccccccccccccccccccccc # <= Problem: moved to new line
However if I increase char_margin any further ( 57<=char_margin), it also mixes lines, but it never puts everything on the same line, which should be on the same line
print(extract_text(pdf_file_name, laparams=LAParams(line_margin = -1, char_margin=57.0, boxes_flow=None)))
result:
AAAAAAAAAAAAAAAAAAAAAAAAAAA # <= Problem. Taken out of line
26.07.2019 02:04 750,00 -750,00
BBBBBBBBBBBBB
05.08.2019 / -
Ccccccccccccccccccccccc # <= Problem: put on the new line
So, Is there any way, a desired output can be achieved?
For myself I solved the problem by creating my own function, which extracts text the way I need, but it would be nice to have similar functionality directly in pdf miner, as I do not think I need something special
I cannot find any setting that gets the desired output. But I'm not sure why a low line-margin and a high char-margin don't give the expected result. In my head, this should merge everything horizontally and keep everything vertically separate. But it doesn't. So this needs some thorough analysis.
Ok, thanks for the answer. I have resolved my immediate need with my own code.) I will also try to look at what is going wrong. Can you mark this issue as a bug meanwhile may be?