camelot Great library, but dependencies ??!!

trafficstars

Note: This is not an issue, yet no better place to discuss on this.

Stats below are pulled from PyPI downloads. Despite being a better process than the others, what do you think supports the less usage.

Aug 13 '19 13:08 akshowhini

Yep, this is a known issue. We need to figure out a way to replace ghostscript and opencv. https://github.com/camelot-dev/camelot/issues/13

Camelot uses only a small subset of code from ghostscript [1] (converting PDF to PNG) and opencv [2] (adaptive thresholding and morphological transformations). The only way I can think of is to re-implement these in Python and have them inside Camelot itself. [2] should be straightforward. Do you have any ideas around [1]? Or any other pointers?

Ghostscript is written in C, I tried looking around in the huge codebase but was totally lost. I'm planning to look into this again by allotting time over the next month, currently the day job takes up a lot of the time. Any pointers would really help!

Aug 27 '19 14:08 vinayak-mehta

@jnothman Do you have any pointers around this?

Aug 27 '19 14:08 vinayak-mehta

@vinayak-mehta , Have you tried pdftoppm( poppler utils) for converting pdf to png.

Aug 27 '19 18:08 satkatai

Yep, I tried it along with imagemagick before landing on ghostscript since the last one gave the best results in terms of image quality.

Aug 27 '19 18:08 vinayak-mehta

Ok, In this post, there was one more suggestion to do this with MuPDF

Aug 27 '19 18:08 satkatai

Hey @vinayak-mehta, not sure where I can help here! The change to pdfbox in #30 has been implemented too, but you need to confirm what kinds of discrepancies are acceptable between backends.

Aug 27 '19 20:08 jnothman

@vinayak-mehta I did not dig into the dependencies & library much. Based on those numbers and hoping to help with reduced dependencies and offering Pro service (to extract tables from images and scan PDFs) for camelot devs, i worked for https://extracttable.com to develop ~~CamelotPro~~ (taken down because of naming conflict)

If you think the service as an add-on helps the regular camelot users like me, I would be happy to talk with my team to merge CamelotPro with the open sourced lib

Aug 28 '19 12:08 akshowhini

@akshowhini Camelot already is open source and MIT licensed. It looks like your CamelotPro uses some of the sources from Camelot (I guess to make it compatible), and is GPL 3.0 licensed. It will be nice to mention the original authors somewhere as well.

The SaaS backend you're using is proprietary licensed, but I supposed also uses Camelot.

I'd be interested to know more about the CamelotPro flavour, for example - does it autodetect whether it should use lattice or stream ?

Aug 28 '19 13:08 dimitern

If you think the service as an add-on helps the regular camelot users like me, I would be happy to talk with my team to merge CamelotPro with the open sourced lib

I think it would make sense to include CamelotPro in Camelot if former's code is open-sourced. (assuming it works well on the current tests and more image-based tests)

I'd be interested to know more about the CamelotPro flavour, for example - does it autodetect whether it should use lattice or stream ?

I'd be interested in learning about the internal workings too!

I tried running the example you've provided in the README but it fails with a KeyError: https://github.com/ExtractTable/camelotpro/issues/1

Aug 28 '19 15:08 vinayak-mehta

@dimitern

Reg: Credits - : Wonder, how in the world I missed it. Thankful to all you guys for the contributions. Updated the readme as well.

Reg: flavor recognition -: No, the AI model does not care about lattice or stream, all it was trained is to detect the tabular structure - consider as a replacement for Nurminen's algorithm.

Aug 28 '19 19:08 akshowhini

I think it would make sense to include CamelotPro in Camelot if former's code is open-sourced. (assuming it works well on the current tests and more image-based tests)

I do not think camelotpro catches the tests of camelot-py, the base and main problem trying to tackle here is to extract tabular structure and characters from images and scan pdfs

Aug 28 '19 19:08 akshowhini

converting PDF to PNG

Hi @vinayak-mehta , have you heard about pdf2image? It is much less pain than Ghostscript. I don't recommend mupdf, as it is licensed under AGPL. Not good for commercial usage.

Sep 14 '20 20:09 luke4u

@luke4u Yes I've heard about it. I was initally reluctant to replace ghostscript with pdf2image as users on all platforms would have to install it separately too. Doing this replacement would be difficult as it would break installations for old users when they suddenly need poppler-utils instead of ghostscript after they upgrade their camelot version. I guess it could be done in a backwards-compatible manner where ghostscript and pdf2image are different backends that camelot can use based on availability.

I'll try to move fast on my goal of not using an external pdf -> png conversion tool as a depedency altogether, and make the library self-contained.

Sep 14 '20 23:09 vinayak-mehta

thank you @vinayak-mehta . look forward to your progress!

Sep 16 '20 09:09 luke4u

Trying to install pdftopng and getting this error. Worth using more stable dependencies:

  • Installing pdftopng (0.2.3): Failed

  RuntimeError

  Unable to find installation candidates for pdftopng (0.2.3)

  at ~/Library/Application Support/pypoetry/venv/lib/python3.11/site-packages/poetry/installation/chooser.py:73 in choose_for
       69│
       70│             links.append(link)
       71│
       72│         if not links:
    →  73│             raise RuntimeError(f"Unable to find installation candidates for {package}")
       74│
       75│         # Get the best link
       76│         chosen = max(links, key=lambda link: self._sort_key(package, link))

Dec 13 '23 01:12 HamedMP

Trying to install pdftopng and getting this error. Worth using more stable dependencies:

  • Installing pdftopng (0.2.3): Failed

  RuntimeError

  Unable to find installation candidates for pdftopng (0.2.3)

  at ~/Library/Application Support/pypoetry/venv/lib/python3.11/site-packages/poetry/installation/chooser.py:73 in choose_for
       69│
       70│             links.append(link)
       71│
       72│         if not links:
    →  73│             raise RuntimeError(f"Unable to find installation candidates for {package}")
       74│
       75│         # Get the best link
       76│         chosen = max(links, key=lambda link: self._sort_key(package, link))

I have the same error with poetry add "camelot-py[base]"

Jan 31 '24 07:01 agamm

camelot camelot copied to clipboard

Great library, but dependencies ??!!

camelot
camelot copied to clipboard