camelot IndexError while using split

IndexError thrown when using split_text=True

Traceback (most recent call last):
  File "/code/seperate_page.py", line 16, in <module>
    tables = camelot.read_pdf(out_filename,
  File "/usr/local/lib/python3.9/site-packages/camelot/io.py", line 113, in read_pdf
    tables = p.parse(
  File "/usr/local/lib/python3.9/site-packages/camelot/handlers.py", line 176, in parse
    t = parser.extract_tables(
  File "/usr/local/lib/python3.9/site-packages/camelot/parsers/lattice.py", line 431, in extract_tables
    table = self._generate_table(table_idx, cols, rows, v_s=v_s, h_s=h_s)
  File "/usr/local/lib/python3.9/site-packages/camelot/parsers/lattice.py", line 372, in _generate_table
    indices = Lattice._reduce_index(
  File "/usr/local/lib/python3.9/site-packages/camelot/parsers/lattice.py", line 191, in _reduce_index
    if t.cells[r_idx][c_idx].hspan:
IndexError: list index out of range

Steps to reproduce the bug

Code

import camelot

# add your code here
tables = camelot.read_pdf('service_providers_ul.0.pdf',
                          backend='poppler',
                          pages='1',
                          flavor='lattice',
                          split_text=True)

PDF

service_providers_ul.0.pdf

Screenshots

Not Applicable

Environment

OS: Linux
Python version: 3.9.4
Numpy version: 1.22.1
OpenCV version: 4.5.5.6
Ghostscript version: 0.7
Camelot version: 0.10.1

Additional context

There is an empty textline in one of the rows which goes past the edge of the last column, this causes split_textline code to assign a column index past the availble column indices, and causes the code to throw an exception further down the line when the assigned column index is used

Mar 07 '22 10:03 ramSeraph

Probably related to #98 #285

Mar 07 '22 10:03 ramSeraph

Cause of Error: https://github.com/camelot-dev/camelot/blob/master/camelot/utils.py#L626

Shouldn't unconditionally increment the index, bounds should have been checked

Mar 07 '22 10:03 ramSeraph

Can confirm that changing that piece of code to check for column index bounds fixed the problem

Mar 07 '22 10:03 ramSeraph

In general.. Can we split PDFMiner texttline objects based on whitespace and only use the non-blank areas in checking for inclusion in table/column/row/cell?

This might fix most of the errors I have encountered with this library.

Mar 07 '22 10:03 ramSeraph

@ramSeraph Can you elaborate on how to implement the fix for this? I think I have the same issue with this doc:

Example 1.pdf

Mar 18 '22 02:03 drennapete

is this the same issue?: https://github.com/atlanhq/camelot/pull/475

Mar 18 '22 02:03 drennapete

@ramSeraph Can you elaborate on how to implement the fix for this? I think I have the same issue with this doc:

Example 1.pdf

it might be the same issue. Running with the following fix went through

diff --git a/camelot/utils.py b/camelot/utils.py
index 404c00b..e5f2cbc 100644
--- a/camelot/utils.py
+++ b/camelot/utils.py
@@ -623,7 +623,8 @@ def split_textline(table, textline, direction, flag_size=False, strip_text=""):
                         else:
                             # TODO: add test
                             if cut == x_cuts[-1]:
-                                cut_text.append((r, cut[0] + 1, obj))
+                                new_idx = min(cut[0] + 1, len(table.cols) - 1)
+                                cut_text.append((r, new_idx, obj))
                     elif isinstance(obj, LTAnno):
                         cut_text.append((r, cut[0], obj))
         elif direction == "vertical" and not textline.is_empty():
@@ -656,7 +657,8 @@ def split_textline(table, textline, direction, flag_size=False, strip_text=""):
                         else:
                             # TODO: add test
                             if cut == y_cuts[-1]:
-                                cut_text.append((cut[0] - 1, c, obj))
+                                new_idx = max(cut[0] - 1, 0)
+                                cut_text.append((new_idx, c, obj))
                     elif isinstance(obj, LTAnno):
                         cut_text.append((cut[0], c, obj))
     except IndexError:

Mar 18 '22 11:03 ramSeraph

is this the same issue?: atlanhq/camelot#475

The fix in the pull request deals with the problem further down the line. The diff i posted in the above comment is closer to the origin of the problem. Even then.. it is still a band-aid.

That whole function is complicated.. and that original use case for which the cut[0] + 1 and the cut[0] - 1 was done is not clear to me. So, I would rather leave it upto the original author to properly fix this.

Mar 18 '22 11:03 ramSeraph

I personally pinned the camelot version in my project and monkeypatched the fix in.

Mar 18 '22 11:03 ramSeraph

Awesome, thanks for the code excerpt! I've never edited a package before, but I'll give it a go.

Mar 19 '22 06:03 drennapete

Looks like I'm having a similar issue with split_text. How do I utilize the fix that you wrote?

Dec 01 '22 20:12 motougo

IndexError while using split_text