IndexError while using split_text
IndexError thrown when using split_text=True
Traceback (most recent call last):
File "/code/seperate_page.py", line 16, in <module>
tables = camelot.read_pdf(out_filename,
File "/usr/local/lib/python3.9/site-packages/camelot/io.py", line 113, in read_pdf
tables = p.parse(
File "/usr/local/lib/python3.9/site-packages/camelot/handlers.py", line 176, in parse
t = parser.extract_tables(
File "/usr/local/lib/python3.9/site-packages/camelot/parsers/lattice.py", line 431, in extract_tables
table = self._generate_table(table_idx, cols, rows, v_s=v_s, h_s=h_s)
File "/usr/local/lib/python3.9/site-packages/camelot/parsers/lattice.py", line 372, in _generate_table
indices = Lattice._reduce_index(
File "/usr/local/lib/python3.9/site-packages/camelot/parsers/lattice.py", line 191, in _reduce_index
if t.cells[r_idx][c_idx].hspan:
IndexError: list index out of range
Steps to reproduce the bug
Code
import camelot
# add your code here
tables = camelot.read_pdf('service_providers_ul.0.pdf',
backend='poppler',
pages='1',
flavor='lattice',
split_text=True)
Screenshots
Not Applicable
Environment
- OS: Linux
- Python version: 3.9.4
- Numpy version: 1.22.1
- OpenCV version: 4.5.5.6
- Ghostscript version: 0.7
- Camelot version: 0.10.1
Additional context
There is an empty textline in one of the rows which goes past the edge of the last column, this causes split_textline code to assign a column index past the availble column indices, and causes the code to throw an exception further down the line when the assigned column index is used
Probably related to #98 #285
Cause of Error: https://github.com/camelot-dev/camelot/blob/master/camelot/utils.py#L626
Shouldn't unconditionally increment the index, bounds should have been checked
Can confirm that changing that piece of code to check for column index bounds fixed the problem
In general.. Can we split PDFMiner texttline objects based on whitespace and only use the non-blank areas in checking for inclusion in table/column/row/cell?
This might fix most of the errors I have encountered with this library.
@ramSeraph Can you elaborate on how to implement the fix for this? I think I have the same issue with this doc:
is this the same issue?: https://github.com/atlanhq/camelot/pull/475
@ramSeraph Can you elaborate on how to implement the fix for this? I think I have the same issue with this doc:
it might be the same issue. Running with the following fix went through
diff --git a/camelot/utils.py b/camelot/utils.py
index 404c00b..e5f2cbc 100644
--- a/camelot/utils.py
+++ b/camelot/utils.py
@@ -623,7 +623,8 @@ def split_textline(table, textline, direction, flag_size=False, strip_text=""):
else:
# TODO: add test
if cut == x_cuts[-1]:
- cut_text.append((r, cut[0] + 1, obj))
+ new_idx = min(cut[0] + 1, len(table.cols) - 1)
+ cut_text.append((r, new_idx, obj))
elif isinstance(obj, LTAnno):
cut_text.append((r, cut[0], obj))
elif direction == "vertical" and not textline.is_empty():
@@ -656,7 +657,8 @@ def split_textline(table, textline, direction, flag_size=False, strip_text=""):
else:
# TODO: add test
if cut == y_cuts[-1]:
- cut_text.append((cut[0] - 1, c, obj))
+ new_idx = max(cut[0] - 1, 0)
+ cut_text.append((new_idx, c, obj))
elif isinstance(obj, LTAnno):
cut_text.append((cut[0], c, obj))
except IndexError:
is this the same issue?: atlanhq/camelot#475
The fix in the pull request deals with the problem further down the line. The diff i posted in the above comment is closer to the origin of the problem. Even then.. it is still a band-aid.
That whole function is complicated.. and that original use case for which the cut[0] + 1 and the cut[0] - 1 was done is not clear to me. So, I would rather leave it upto the original author to properly fix this.
I personally pinned the camelot version in my project and monkeypatched the fix in.
Awesome, thanks for the code excerpt! I've never edited a package before, but I'll give it a go.
Looks like I'm having a similar issue with split_text. How do I utilize the fix that you wrote?