camelot
camelot copied to clipboard
Crash on 100% RAW Powerlifting PDF
Running Camelot (dev, 2830ed941808c8b514c5be74db1d840b45b26660) against the following PDF produces a crash after about two minutes of 100% CPU processing: https://rawpowerlifting.com/wp-content/uploads/2018/08/2018-Southern-Open-Results.pdf
The specific output is:
[sstangl@mazu camelot]$ python3 -m camelot -f csv -o results.csv lattice 2018-Southern-Open-Results.pdf
2018-10-26T07:23:06 - INFO - Processing page-1
Traceback (most recent call last):
File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/sstangl/dev/camelot/camelot/__main__.py", line 16, in <module>
main()
File "/home/sstangl/dev/camelot/camelot/__main__.py", line 12, in main
cli()
File "/usr/lib/python3.6/site-packages/click/core.py", line 721, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.6/site-packages/click/core.py", line 1065, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3.6/site-packages/click/core.py", line 894, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.6/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/decorators.py", line 64, in new_func
return ctx.invoke(f, obj, *args[1:], **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/home/sstangl/dev/camelot/camelot/cli.py", line 105, in lattice
suppress_warnings=suppress_warnings, **kwargs)
File "/home/sstangl/dev/camelot/camelot/io.py", line 99, in read_pdf
tables = p.parse(flavor=flavor, **kwargs)
File "/home/sstangl/dev/camelot/camelot/handlers.py", line 146, in parse
t = parser.extract_tables(p)
File "/home/sstangl/dev/camelot/camelot/parsers/lattice.py", line 364, in extract_tables
table = self._generate_table(table_idx, cols, rows, v_s=v_s, h_s=h_s)
File "/home/sstangl/dev/camelot/camelot/parsers/lattice.py", line 304, in _generate_table
table = table.set_edges(v_s, h_s, joint_close_tol=self.joint_close_tol)
File "/home/sstangl/dev/camelot/camelot/core.py", line 263, in set_edges
self.cells[L][J].bottom = True
IndexError: list index out of range
System information:
Linux-4.18.14-200.fc28.x86_64-x86_64-with-fedora-28-Twenty_Eight
Python 3.6.6 (default, Jul 19 2018, 14:25:17)
[GCC 8.1.1 20180712 (Red Hat 8.1.1-5)]
NumPy 1.14.5
OpenCV 3.4.1
Would you be open to accepting a pull request for a regression suite with actual PDFs? Although the PDFs themselves cannot be included in the repo due to copyright issues, a large number of them are crawled and re-hosted by the Internet Archive. It would be possible to just link to the Internet Archive's version and download from them on build. I'd be interested in helping maintain a regression suite.
@sstangl This looks like a bug, let me look into this.
There's already a files directory inside the tests folder which contains PDFs for the unit tests. Why do you want to add a suite which downloads PDFs from internet archive, that would also increase test running times.
Why do you want to add a suite which downloads PDFs from internet archive
For regression testing.
I work on a project that handles lots of PDF files, but Tabula fails to convert them properly. I wind up converting them by hand. Instead of throwing away that work, I would like to add the hand-converted data to a test suite, so that Camelot can learn to parse such files correctly and not regress.
that would also increase test running times.
All test runners support a cache layer -- these PDFs could be saved to test cache. A simple script could then check for whether the file already exists, and only download it if not. So it would not increase running time, except for the initial download.
I would like to add the hand-converted data to a test suite, so that Camelot can learn to parse such files correctly and not regress.
Currently, Camelot cannot 'learn' to parse such files correctly, though I see how it can add value in testing.
All test runners support a cache layer -- these PDFs could be saved to test cache. A simple script could then check for whether the file already exists, and only download it if not. So it would not increase running time, except for the initial download.
Camelot uses Travis and I see that it has support to cache directories. Like you said, we can add a caching stage inside the .travis.yml which downloads new files into the cache directory using a script. You can go ahead and create a PR for this, though we'll have to look at time taken to run all the tests before merging this into master.
Sorry for the late reply.