python-pdfbox Use JPype to call into jars directly

Jul 29 '19 13:07 KOLANICH

Sounds like a good idea. Can you please try the code in the use-jpype branch and let me know if you encounter any issues? The API is almost the same as what is in master. There currently isn't any control over content sent to stdout by the Java library.

Jul 29 '19 19:07 lebedov

IMHO we need a deeper integration. I mean no temporary files, only blobs in memory. No command line arguments, filling the structures directly. Ideally the same capabilities as using pdrbox as a lib from Java, but with all necessary wrappers removing the burden of converting python objects to Java ones (IDK if any of it in this lib, but I had some experience with some apps, dealing with immutable types. It was pain, I had to write some functions which only purpose was patching immutable objects by parsing them into dicts, patching the dicts and then transforming dicts back to immutable objects. Though the result worthed it - the app started to work much faster, I got rid of temporary files and got access to the features not exposed via CLI) from programmer.

Jul 30 '19 07:07 KOLANICH

I put together a quick wrapper for the PDF to image functionality that may be what you are looking for; it returns the extracted pages as RGB numpy arrays. I don't have time to create a full-blown Python interface to the pdfbox Java API, but I can add the above gist to python-pdfbox as a separate function (or perhaps combine it with the jar download code and submit it to camelot as a PR).

Aug 04 '19 19:08 lebedov

I don't have time to create a full-blown Python interface to the pdfbox Java API

For the first time we don't need full-blown, just keep the existing python-pdfbox one, but overcome limitations of CLI interface by changing the way pdfhox is called.

or perhaps combine it with the jar download code

IMHO: it shouldn't download and install jars. Downloading and/or installing jars is either user's burden, or systemwide package manager's (such as apt, portage, brew, nix and conda), or installer's. Not ours. Not camelot's.

Aug 04 '19 20:08 KOLANICH

Since a major design goals of python-pdfbox is enabling users to quickly access pdfbox features regardless of their jar management preferences, I don't wish to remove the automated download feature. Moreover, python-pdfbox permits one to specify the location of the jar file via an environmental variable if one does not want to rely upon the automated download.

Aug 12 '19 19:08 lebedov

I put together a quick wrapper for the PDF to image functionality that may be what you are looking for; it returns the extracted pages as RGB numpy arrays.

Not being aware of this thread, I wrote a similar post and gist independently 😅: https://github.com/pypdfium2-team/pypdfium2/discussions/230 https://gist.github.com/mara004/51c3216a9eabd3dcbc78a86d877a61dc

The gist targets Pillow and else differs slightly, so may still be worth taking a look at.

I don't have time to create a full-blown Python interface to the pdfbox Java API, but I can add the above gist to python-pdfbox as a separate function (or perhaps combine it with the jar download code and submit it to camelot as a PR).

Same problem for me (lack of time), but incorporating the gist code as a new function for direct rendering sounds like a good idea.

Jun 22 '23 13:06 mara004

IMHO: it shouldn't download and install jars. Downloading and/or installing jars is either user's burden, or systemwide package manager's [...], or installer's. Not ours.

Since a major design goals of python-pdfbox is enabling users to quickly access pdfbox features regardless of their jar management preferences, I don't wish to remove the automated download feature.

I see both points, but getting a jar out of the box is indeed much more convenient than having to supply one manually, esp. seeing as it's a library. An app end user shouldn't have to supply any binaries.

However, I'd recommend to move the Jar handling logic to setup stage (e.g. similar to how pypdfium2 bundles binaries on setup). This should not be library runtime code.

Jun 22 '23 13:06 mara004

However, I'd recommend to move the Jar handling logic to setup stage (e.g. similar to how pypdfium2 bundles binaries on setup).

python3 ./setup.py is an obsolete way to install packages. Packages are installed from wheels. And I insist that no jar should be bundled into a wheel. A proper way to handle that is to create a dedicated library which only purpose is to discover jar packages in several OS-specific and distro-specific places, including Maven repos.

Jun 22 '23 16:06 KOLANICH

python3 ./setup.py is an obsolete way to install packages.

@KOLANICH I think there may be a misunderstanding. Calling setup.py directly is obsolete, yes, but setup.py files on their own are not deprecated for anyone needing to run dynamic code on setup, it's just they're supposed to be called only through pip/build/etc. It is merely recommended to use only declarative config where possible (i.e. pure-python projects) to reduce the code execution surface on package installation.

Packages are installed from wheels. And I insist that no jar should be bundled into a wheel.

Or from sdists. Actually it was the main intent behind the wheel format to bundle binaries; for the source alone there's the sdist format. You can pass --no-binary=$PKGNAME to pip if you wish to use an sdist. This distinction could be used for a pdfbox wrapper, e.g. by making it configurable on setup whether to download a Jar or taking a user-supplied one. wheels should do bundle pdfbox IMO.

Jun 22 '23 17:06 mara004

but setup.py files on their own are not deprecated for anyone needing to run dynamic code on setup

You cannot call code on package installation because packages are installed from wheels, not sdists. Again, packages are NOT instelled from sdists. When one "installs" a package from an sdist, pip builds a wheel and installs a package from that wheel. This way every files left in the system by setup.py is just a side effect (that can vanish if package building goes more isolated not in the sense "we install deps every time from scratch" (as the term is misused today), but in the sense "package building process cannot write anywhere except its build dir, and it is enforced by OS (landlock), sandbox (firejail, SandboxIE), interpreter (a task for future) and maybe hypervisor") only". Installation of an sdist is just a workaround for the cases there is no prebuilt wheel for the platform. If anything must be installed from a python package, it should be contained within a wheel.

wheels should do bundle pdfbox IMO.

No, they should not. Bundling anything is bloatware. If someone wants to install pdfbox via pip (a very strange thing to want, you don't install python packages via cargo, npm, go install, composer and so on), there can be a separate wheel containing pdfbox jar only and that python bloatware package name be used as an extra dependency to this package.

Jun 22 '23 18:06 KOLANICH

When one "installs" a package from an sdist, pip builds a wheel and installs a package from that wheel.

That may be correct, but is not relevant here. The main distinction relevant for this context is that installing an sdist runs setup code, while a wheel does not (it's already a frozen file set). What I was thinking of is an environment variable to control pdfbox inclusion on setup time.

If you do not want a bundled pdfbox, you can install via sdist, opt out with that env var and point to a system/caller-provided pdfbox binary instead. That's what a linux distro packager could use, for instance.

This way every files left in the system by setup.py is just a side effect

I would not download pdfbox somewhere into the system but into the source tree and flag it as package data, which is officially supported and not a breakage risk.

I clearly technically disagree with what you're writing and doubt if there's much point in further discussing this with you.

Jun 22 '23 21:06 mara004

FWIW, bundling is also desired by the camelot project:

Just went through python-pdfbox, it automatically downloads and caches the pdfbox jar file which should make installation easier for users, as installing ghostscript has been a pain on Windows. But then again, the users would need Java to use the library.

source: @vinayak-mehta in https://github.com/atlanhq/camelot/pull/346#issuecomment-508947266

Jun 22 '23 21:06 mara004

If you do not want a bundled pdfbox, you can install via sdist, opt out with that env var and point to a system/caller-provided pdfbox binary instead.

IMHO it should be opt-in rather than opt-out. And by default it should use the package from the system.

I would not download pdfbox somewhere into the system but into the source tree and flag it as package data, which is officially supported and not a breakage risk.

This way the talk about using sdists is irrelevant. The wheels having a jar within them should woork fine.

Jun 22 '23 21:06 KOLANICH

IMHO it should be opt-in rather than opt-out. And by default it should use the package from the system.

Hmm, yeah, a smart setup logic like "use system pdfbox if available, download pdfbox otherwise" would be a possibility. However, I still think system pdfbox should only be taken on explicit user pointer.

This way the talk about using sdists is irrelevant. The wheels having a jar within them should woork fine.

I don't think I understand you here? Why should sdists be irrelevant now? Isn't that contradictory to what you wrote above? Do you mean you actually agree with bundling pdfbox in wheels now?

Jun 23 '23 12:06 mara004

a smart setup logic

should be incapsulated into a separate lib because it is needed for pretty any Python wrapper for a Java lib.

Do you mean you actually agree with bundling pdfbox in wheels now?

I don't think we should bundle the jar. Bundling = bloatware.
if one wants to install pdfbox jar via pip, it should be within a wheel. Again, I don't think that installing pdfbox jar via wheels is a right thing to do. But if you still want to achieve it, then it should be within a wheel.
to minimize bloat of the python package, if you still want that, I propose to put it intoma separate package and add it as an optiknal dependency. If someone wants that, it can be installed by specifying the extra.

Jun 23 '23 18:06 KOLANICH

However, I still think system pdfbox should only be taken on explicit user pointer.

I think system pdfbox should be picked automatically and by default. And jars in other wrapper libs should be picked the same way. Automatically and by default. That's why we need a special lib to discover jars in various system-specific places.

Jun 23 '23 18:06 KOLANICH

python-pdfbox python-pdfbox copied to clipboard

Use JPype to call into jars directly

python-pdfbox
python-pdfbox copied to clipboard