camelot [parsers.stream] - Fix IndexError when extracting more tables than there are columns

[parsers.stream] - Fix IndexError when extracting more tables than there are columns

Open JosePVB opened this issue 5 years ago • 4 comments

The changes in this PR are based on the conversation on https://github.com/atlanhq/camelot/issues/357, but does not address the enhancement on the open issue that came out of the previous issue; https://github.com/camelot-dev/camelot/issues/50

What this PR addresses is the challenge in using the Stream parser to extract tables out of a PDF with known, consistent table structures of interest to the caller, but that may be variable in height, starting position, or number, especially within an automated or programmatic context.

Feb 01 '20 01:02 JosePVB

@vinayak-mehta Please provide guidance on what I should do about the failed checks. For DeepSource, two of the issues are not created by this PR and the other two are due to my decision to keep internal consistency within the files that I added to.

With regards to the failed Travis CI build, existing tests on the most recent upstream/master branch were failing for me locally prior to any modifications. Here is my development environment.

>>> import platform; print(platform.platform())
Linux-4.15.0-74-generic-x86_64-with-Ubuntu-18.04-bionic
>>> import sys; print('Python', sys.version)
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]
>>> import numpy; print('NumPy', numpy.__version__)
NumPy 1.18.1
>>> import cv2; print('OpenCV', cv2.__version__)
OpenCV 4.1.2
>>> import camelot; print('Camelot', camelot.__version__)
Camelot 0.7.3

The test added in this PR passed in all the Travis CI build minus the Python 2.7 version. I do not see how the failed tests on the other Python builds could have failed as a result of the changes of this PR. Therefore, I highly appreciate your guidance on these matters.

Besides those two pending issues, I feel that the changes are ready for review.

Feb 01 '20 02:02 JosePVB

Thanks for the PR! I'll review this tomorrow.

Mar 21 '20 13:03 vinayak-mehta

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

Do you want to open the PR against that branch so that we can merge your improvement?

Feb 25 '24 11:02 MartinThoma

This issue fixed would be huge. I will move to pypdf_table_extraction if this works out.

Feb 28 '24 06:02 bmax

camelot camelot copied to clipboard

[parsers.stream] - Fix IndexError when extracting more tables than there are columns

camelot
camelot copied to clipboard