camelot icon indicating copy to clipboard operation
camelot copied to clipboard

IndexError while using split_text

Open ramSeraph opened this issue 3 years ago • 10 comments

IndexError thrown when using split_text=True

Traceback (most recent call last):
  File "/code/seperate_page.py", line 16, in <module>
    tables = camelot.read_pdf(out_filename,
  File "/usr/local/lib/python3.9/site-packages/camelot/io.py", line 113, in read_pdf
    tables = p.parse(
  File "/usr/local/lib/python3.9/site-packages/camelot/handlers.py", line 176, in parse
    t = parser.extract_tables(
  File "/usr/local/lib/python3.9/site-packages/camelot/parsers/lattice.py", line 431, in extract_tables
    table = self._generate_table(table_idx, cols, rows, v_s=v_s, h_s=h_s)
  File "/usr/local/lib/python3.9/site-packages/camelot/parsers/lattice.py", line 372, in _generate_table
    indices = Lattice._reduce_index(
  File "/usr/local/lib/python3.9/site-packages/camelot/parsers/lattice.py", line 191, in _reduce_index
    if t.cells[r_idx][c_idx].hspan:
IndexError: list index out of range

Steps to reproduce the bug

Code

import camelot

# add your code here
tables = camelot.read_pdf('service_providers_ul.0.pdf',
                          backend='poppler',
                          pages='1',
                          flavor='lattice',
                          split_text=True)

PDF

service_providers_ul.0.pdf

Screenshots

Not Applicable

Environment

  • OS: Linux
  • Python version: 3.9.4
  • Numpy version: 1.22.1
  • OpenCV version: 4.5.5.6
  • Ghostscript version: 0.7
  • Camelot version: 0.10.1

Additional context

There is an empty textline in one of the rows which goes past the edge of the last column, this causes split_textline code to assign a column index past the availble column indices, and causes the code to throw an exception further down the line when the assigned column index is used

ramSeraph avatar Mar 07 '22 10:03 ramSeraph

Probably related to #98 #285

ramSeraph avatar Mar 07 '22 10:03 ramSeraph

Cause of Error: https://github.com/camelot-dev/camelot/blob/master/camelot/utils.py#L626

Shouldn't unconditionally increment the index, bounds should have been checked

ramSeraph avatar Mar 07 '22 10:03 ramSeraph

Can confirm that changing that piece of code to check for column index bounds fixed the problem

ramSeraph avatar Mar 07 '22 10:03 ramSeraph

In general.. Can we split PDFMiner texttline objects based on whitespace and only use the non-blank areas in checking for inclusion in table/column/row/cell?

This might fix most of the errors I have encountered with this library.

ramSeraph avatar Mar 07 '22 10:03 ramSeraph

@ramSeraph Can you elaborate on how to implement the fix for this? I think I have the same issue with this doc:

Example 1.pdf

drennapete avatar Mar 18 '22 02:03 drennapete

is this the same issue?: https://github.com/atlanhq/camelot/pull/475

drennapete avatar Mar 18 '22 02:03 drennapete

@ramSeraph Can you elaborate on how to implement the fix for this? I think I have the same issue with this doc:

Example 1.pdf

it might be the same issue. Running with the following fix went through

diff --git a/camelot/utils.py b/camelot/utils.py
index 404c00b..e5f2cbc 100644
--- a/camelot/utils.py
+++ b/camelot/utils.py
@@ -623,7 +623,8 @@ def split_textline(table, textline, direction, flag_size=False, strip_text=""):
                         else:
                             # TODO: add test
                             if cut == x_cuts[-1]:
-                                cut_text.append((r, cut[0] + 1, obj))
+                                new_idx = min(cut[0] + 1, len(table.cols) - 1)
+                                cut_text.append((r, new_idx, obj))
                     elif isinstance(obj, LTAnno):
                         cut_text.append((r, cut[0], obj))
         elif direction == "vertical" and not textline.is_empty():
@@ -656,7 +657,8 @@ def split_textline(table, textline, direction, flag_size=False, strip_text=""):
                         else:
                             # TODO: add test
                             if cut == y_cuts[-1]:
-                                cut_text.append((cut[0] - 1, c, obj))
+                                new_idx = max(cut[0] - 1, 0)
+                                cut_text.append((new_idx, c, obj))
                     elif isinstance(obj, LTAnno):
                         cut_text.append((cut[0], c, obj))
     except IndexError:

ramSeraph avatar Mar 18 '22 11:03 ramSeraph

is this the same issue?: atlanhq/camelot#475

The fix in the pull request deals with the problem further down the line. The diff i posted in the above comment is closer to the origin of the problem. Even then.. it is still a band-aid.

That whole function is complicated.. and that original use case for which the cut[0] + 1 and the cut[0] - 1 was done is not clear to me. So, I would rather leave it upto the original author to properly fix this.

ramSeraph avatar Mar 18 '22 11:03 ramSeraph

I personally pinned the camelot version in my project and monkeypatched the fix in.

ramSeraph avatar Mar 18 '22 11:03 ramSeraph

Awesome, thanks for the code excerpt! I've never edited a package before, but I'll give it a go.

drennapete avatar Mar 19 '22 06:03 drennapete

Looks like I'm having a similar issue with split_text. How do I utilize the fix that you wrote?

motougo avatar Dec 01 '22 20:12 motougo