camelot icon indicating copy to clipboard operation
camelot copied to clipboard

Crash on 100% RAW Powerlifting PDF

Open sstangl opened this issue 6 years ago • 3 comments

Running Camelot (dev, 2830ed941808c8b514c5be74db1d840b45b26660) against the following PDF produces a crash after about two minutes of 100% CPU processing: https://rawpowerlifting.com/wp-content/uploads/2018/08/2018-Southern-Open-Results.pdf

The specific output is:

[sstangl@mazu camelot]$ python3 -m camelot -f csv -o results.csv lattice 2018-Southern-Open-Results.pdf 
2018-10-26T07:23:06 - INFO - Processing page-1
Traceback (most recent call last):
  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sstangl/dev/camelot/camelot/__main__.py", line 16, in <module>
    main()
  File "/home/sstangl/dev/camelot/camelot/__main__.py", line 12, in main
    cli()
  File "/usr/lib/python3.6/site-packages/click/core.py", line 721, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1065, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 894, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args[1:], **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/sstangl/dev/camelot/camelot/cli.py", line 105, in lattice
    suppress_warnings=suppress_warnings, **kwargs)
  File "/home/sstangl/dev/camelot/camelot/io.py", line 99, in read_pdf
    tables = p.parse(flavor=flavor, **kwargs)
  File "/home/sstangl/dev/camelot/camelot/handlers.py", line 146, in parse
    t = parser.extract_tables(p)
  File "/home/sstangl/dev/camelot/camelot/parsers/lattice.py", line 364, in extract_tables
    table = self._generate_table(table_idx, cols, rows, v_s=v_s, h_s=h_s)
  File "/home/sstangl/dev/camelot/camelot/parsers/lattice.py", line 304, in _generate_table
    table = table.set_edges(v_s, h_s, joint_close_tol=self.joint_close_tol)
  File "/home/sstangl/dev/camelot/camelot/core.py", line 263, in set_edges
    self.cells[L][J].bottom = True
IndexError: list index out of range

System information:

Linux-4.18.14-200.fc28.x86_64-x86_64-with-fedora-28-Twenty_Eight
Python 3.6.6 (default, Jul 19 2018, 14:25:17) 
[GCC 8.1.1 20180712 (Red Hat 8.1.1-5)]
NumPy 1.14.5
OpenCV 3.4.1

Would you be open to accepting a pull request for a regression suite with actual PDFs? Although the PDFs themselves cannot be included in the repo due to copyright issues, a large number of them are crawled and re-hosted by the Internet Archive. It would be possible to just link to the Internet Archive's version and download from them on build. I'd be interested in helping maintain a regression suite.

sstangl avatar Oct 26 '18 14:10 sstangl

@sstangl This looks like a bug, let me look into this.

There's already a files directory inside the tests folder which contains PDFs for the unit tests. Why do you want to add a suite which downloads PDFs from internet archive, that would also increase test running times.

vinayak-mehta avatar Oct 28 '18 09:10 vinayak-mehta

Why do you want to add a suite which downloads PDFs from internet archive

For regression testing.

I work on a project that handles lots of PDF files, but Tabula fails to convert them properly. I wind up converting them by hand. Instead of throwing away that work, I would like to add the hand-converted data to a test suite, so that Camelot can learn to parse such files correctly and not regress.

that would also increase test running times.

All test runners support a cache layer -- these PDFs could be saved to test cache. A simple script could then check for whether the file already exists, and only download it if not. So it would not increase running time, except for the initial download.

sstangl avatar Oct 28 '18 17:10 sstangl

I would like to add the hand-converted data to a test suite, so that Camelot can learn to parse such files correctly and not regress.

Currently, Camelot cannot 'learn' to parse such files correctly, though I see how it can add value in testing.

All test runners support a cache layer -- these PDFs could be saved to test cache. A simple script could then check for whether the file already exists, and only download it if not. So it would not increase running time, except for the initial download.

Camelot uses Travis and I see that it has support to cache directories. Like you said, we can add a caching stage inside the .travis.yml which downloads new files into the cache directory using a script. You can go ahead and create a PR for this, though we'll have to look at time taken to run all the tests before merging this into master.

Sorry for the late reply.

vinayak-mehta avatar Oct 30 '18 14:10 vinayak-mehta