python-pdfbox icon indicating copy to clipboard operation
python-pdfbox copied to clipboard

Use JPype to call into jars directly

Open KOLANICH opened this issue 4 years ago • 16 comments

KOLANICH avatar Jul 29 '19 13:07 KOLANICH

Sounds like a good idea. Can you please try the code in the use-jpype branch and let me know if you encounter any issues? The API is almost the same as what is in master. There currently isn't any control over content sent to stdout by the Java library.

lebedov avatar Jul 29 '19 19:07 lebedov

IMHO we need a deeper integration. I mean no temporary files, only blobs in memory. No command line arguments, filling the structures directly. Ideally the same capabilities as using pdrbox as a lib from Java, but with all necessary wrappers removing the burden of converting python objects to Java ones (IDK if any of it in this lib, but I had some experience with some apps, dealing with immutable types. It was pain, I had to write some functions which only purpose was patching immutable objects by parsing them into dicts, patching the dicts and then transforming dicts back to immutable objects. Though the result worthed it - the app started to work much faster, I got rid of temporary files and got access to the features not exposed via CLI) from programmer.

KOLANICH avatar Jul 30 '19 07:07 KOLANICH

I put together a quick wrapper for the PDF to image functionality that may be what you are looking for; it returns the extracted pages as RGB numpy arrays. I don't have time to create a full-blown Python interface to the pdfbox Java API, but I can add the above gist to python-pdfbox as a separate function (or perhaps combine it with the jar download code and submit it to camelot as a PR).

lebedov avatar Aug 04 '19 19:08 lebedov

I don't have time to create a full-blown Python interface to the pdfbox Java API

For the first time we don't need full-blown, just keep the existing python-pdfbox one, but overcome limitations of CLI interface by changing the way pdfhox is called.

or perhaps combine it with the jar download code

IMHO: it shouldn't download and install jars. Downloading and/or installing jars is either user's burden, or systemwide package manager's (such as apt, portage, brew, nix and conda), or installer's. Not ours. Not camelot's.

KOLANICH avatar Aug 04 '19 20:08 KOLANICH

Since a major design goals of python-pdfbox is enabling users to quickly access pdfbox features regardless of their jar management preferences, I don't wish to remove the automated download feature. Moreover, python-pdfbox permits one to specify the location of the jar file via an environmental variable if one does not want to rely upon the automated download.

lebedov avatar Aug 12 '19 19:08 lebedov