camelot
camelot copied to clipboard
Great library, but dependencies ??!!
Note: This is not an issue, yet no better place to discuss on this.
Stats below are pulled from PyPI downloads. Despite being a better process than the others, what do you think supports the less usage.
Yep, this is a known issue. We need to figure out a way to replace ghostscript and opencv. https://github.com/camelot-dev/camelot/issues/13
Camelot uses only a small subset of code from ghostscript [1] (converting PDF to PNG) and opencv [2] (adaptive thresholding and morphological transformations). The only way I can think of is to re-implement these in Python and have them inside Camelot itself. [2] should be straightforward. Do you have any ideas around [1]? Or any other pointers?
Ghostscript is written in C, I tried looking around in the huge codebase but was totally lost. I'm planning to look into this again by allotting time over the next month, currently the day job takes up a lot of the time. Any pointers would really help!
@jnothman Do you have any pointers around this?
@vinayak-mehta , Have you tried pdftoppm( poppler utils) for converting pdf to png.
Yep, I tried it along with imagemagick before landing on ghostscript since the last one gave the best results in terms of image quality.
Hey @vinayak-mehta, not sure where I can help here! The change to pdfbox in #30 has been implemented too, but you need to confirm what kinds of discrepancies are acceptable between backends.
@vinayak-mehta I did not dig into the dependencies & library much. Based on those numbers and hoping to help with reduced dependencies and offering Pro service (to extract tables from images and scan PDFs) for camelot devs, i worked for https://extracttable.com to develop CamelotPro (taken down because of naming conflict)
If you think the service as an add-on helps the regular camelot users like me, I would be happy to talk with my team to merge CamelotPro with the open sourced lib
@akshowhini Camelot already is open source and MIT licensed. It looks like your CamelotPro uses some of the sources from Camelot (I guess to make it compatible), and is GPL 3.0 licensed. It will be nice to mention the original authors somewhere as well.
The SaaS backend you're using is proprietary licensed, but I supposed also uses Camelot.
I'd be interested to know more about the CamelotPro flavour, for example - does it autodetect whether it should use lattice or stream ?
If you think the service as an add-on helps the regular camelot users like me, I would be happy to talk with my team to merge CamelotPro with the open sourced lib
I think it would make sense to include CamelotPro in Camelot if former's code is open-sourced. (assuming it works well on the current tests and more image-based tests)
I'd be interested to know more about the CamelotPro flavour, for example - does it autodetect whether it should use lattice or stream ?
I'd be interested in learning about the internal workings too!
I tried running the example you've provided in the README but it fails with a KeyError: https://github.com/ExtractTable/camelotpro/issues/1
@dimitern
Reg: Credits - : Wonder, how in the world I missed it. Thankful to all you guys for the contributions. Updated the readme as well.
Reg: flavor recognition -: No, the AI model does not care about lattice or stream, all it was trained is to detect the tabular structure - consider as a replacement for Nurminen's algorithm.
I think it would make sense to include CamelotPro in Camelot if former's code is open-sourced. (assuming it works well on the current tests and more image-based tests)
I do not think camelotpro catches the tests of camelot-py, the base and main problem trying to tackle here is to extract tabular structure and characters from images and scan pdfs
converting PDF to PNG
Hi @vinayak-mehta , have you heard about pdf2image? It is much less pain than Ghostscript. I don't recommend mupdf, as it is licensed under AGPL. Not good for commercial usage.
@luke4u Yes I've heard about it. I was initally reluctant to replace ghostscript with pdf2image as users on all platforms would have to install it separately too. Doing this replacement would be difficult as it would break installations for old users when they suddenly need poppler-utils instead of ghostscript after they upgrade their camelot version. I guess it could be done in a backwards-compatible manner where ghostscript and pdf2image are different backends that camelot can use based on availability.
I'll try to move fast on my goal of not using an external pdf -> png conversion tool as a depedency altogether, and make the library self-contained.
thank you @vinayak-mehta . look forward to your progress!
Trying to install pdftopng
and getting this error. Worth using more stable dependencies:
• Installing pdftopng (0.2.3): Failed
RuntimeError
Unable to find installation candidates for pdftopng (0.2.3)
at ~/Library/Application Support/pypoetry/venv/lib/python3.11/site-packages/poetry/installation/chooser.py:73 in choose_for
69│
70│ links.append(link)
71│
72│ if not links:
→ 73│ raise RuntimeError(f"Unable to find installation candidates for {package}")
74│
75│ # Get the best link
76│ chosen = max(links, key=lambda link: self._sort_key(package, link))
Trying to install
pdftopng
and getting this error. Worth using more stable dependencies:• Installing pdftopng (0.2.3): Failed RuntimeError Unable to find installation candidates for pdftopng (0.2.3) at ~/Library/Application Support/pypoetry/venv/lib/python3.11/site-packages/poetry/installation/chooser.py:73 in choose_for 69│ 70│ links.append(link) 71│ 72│ if not links: → 73│ raise RuntimeError(f"Unable to find installation candidates for {package}") 74│ 75│ # Get the best link 76│ chosen = max(links, key=lambda link: self._sort_key(package, link))
I have the same error with poetry add "camelot-py[base]"