Prebuilt binaries
(I'm aware that #16 already exists, I though it would be nice to layout a few reasons in an organized fashion)
This PDF library is, in my experience, the best in the business. PDFMiner, with all due respect, is slow, inaccurate, and inconsistent making impossible in some cases to use reliably. Other XPDF/Poppler bindings are outdated and abandoned. Other workarounds (such as those mentioned in #16) are plagued with some of the same issues (mainly inaccuracy).
This is where pdftotext comes in handy. It's fast and gives accurate results. The only problem is that there's a pretty high barrier for being able to use this package. Developers must install a few packages on a Linux system for this package to be built and installed. Windows users, on the other hand are left with no clue on how to install. This could all be mitigated with prebuilt binaries for Windows, but also other platforms.
Glad you like the library!
I agree that it would be nice to have wheels built automatically for all platforms, but the tools in this space are lacking:
- cibuildwheel is in beta, but it seems stable enough for building wheels. But because we require shared libraries here, then we also need to bundle those, so
- auditwheel can help with the linux bundling, but it doesn't seem to get much use, and its build is currently failing
- delocate can help with the macOS bundling, but it identifies itself as alpha quality without much use in the wild
- there doesn't seem to be any tool out there to help with the bundling on Windows, so that would have to be done manually anyway. I don't own any Windows system to troubleshoot on
- should we just bundle poppler itself, or all its dependencies (libtiff, libpng, fontconfig, freetype, ...) too?
- what versions of libc and libstdc++ should we build with for linux? To get good compatibility across lots of distros, we have to build with something as old as CentOS 5. But then that means we have to use an older version of poppler, complete with all its bugs
As long as the Python packaging community doesn't really address this issue (haven't updated the relevant documentation for over five years!) and the tooling is in a sad state, I would rather keep the status quo. It requires one extra step by the user, but it is simple and reliable and should work on any unix-like system that has poppler.
I didn't realize that this was such a broad issue. Thank you for sharing your position on this. Unfortunately I'm not very experienced when it comes to packaging binary extensions/building them, so thus far I've been unable to build this on Windows.
Hi. I've got a problem. This is the only package that works for my type of PDFs.
I have developers rights on my PC so I've installed python3-dev, popper/dev and so on... Installed pdftotext to my virtual environment and run without a problem.
But when my colleague goes into my shared folder from different PC (he doesn't have dev rights), activate the environment, he fails on import pdftotext because it can't find the libpoppler-cpp.so...
Can it be copied somewhere? Or... How would I provide this script to Linux colleagues?
I didn't realize that this was such a broad issue. Thank you for sharing your position on this. Unfortunately I'm not very experienced when it comes to packaging binary extensions/building them, so thus far I've been unable to build this on Windows.
I'm currently unable to build it on WIndows. Did you ever figure it out? I guess building Poppler, really, is the problem, which seems to work for these guys.
If you need to use this library on Windows, right now you can
- use conda to install poppler as described in the README, or
- build poppler yourself and figure out how to get it all working. See #72 for one example
I don't have or use Windows, so even getting it working in conda was a challenge!
The link you provided is for tsdgeos/poppler_mirror. tsdgeos is the maintainer of poppler, so of course he knows how to build it on Windows! :smile: This issue is more about the difficulty of packing and distributing, and since I don't have Windows, well...
I'm just doing it on Linux. It's a WSL world.
---- On Sun, 27 Sep 2020 20:03:31 -0400 Jason Alan Palmer [email protected] wrote ----
If you need to use this library on Windows, right now you can
use conda to install poppler as described in the README, or
build poppler yourself and figure out how to get it all working. See https://github.com/jalan/pdftotext/pull/72 for one example
I don't have or use Windows, so even getting it working in conda was a challenge!
The link you provided is for https://github.com/tsdgeos/poppler_mirror. tsdgeos is the maintainer of poppler, so of course he knows how to build it on Windows! This issue is more about the difficulty of packing and distributing on Windows, and since I don't have Windows, well...
— You are receiving this because you commented. Reply to this email directly, https://github.com/jalan/pdftotext/issues/29#issuecomment-699706612, or https://github.com/notifications/unsubscribe-auth/ABQ26HFLD3ENX2JJQSJJESDSH7HFHANCNFSM4GNCMH4A.
there doesn't seem to be any tool out there to help with the bundling on Windows, so that would have to be done manually anyway.
That used to be true but now there is https://github.com/adang1345/delvewheel 🥳
delvewheelis a command-line tool for creating Python wheel packages for Windows that have DLL dependencies that may not be present on the target system. It is functionally similar to auditwheel (for Linux) and delocate (for Mac OS).
It should be relatively easy to integrate into cibuildwheel using something like CIBW_REPAIR_WHEEL_COMMAND=delvewheel repair.
I'm going to look into this if noone else is doing that already.
should we just bundle poppler itself, or all its dependencies (libtiff, libpng, fontconfig, freetype, ...) too?
Generally speaking, a binary wheel should contain all dependencies that would not be found on a vanilla system.
what versions of libc and libstdc++ should we build with for linux?
There are three different compatibility sets (and corresponding OS images): manylinux1, manylinux2010 and manylinux2014. If possible, manylinux1 (the oldest) should be targeted.
But then that means we have to use an older version of poppler, complete with all its bugs
Not necessarily. It should be possible to compile poppler against older libc versions.
In general, I agree with your assessment of the wheel packaging toolchain. It really is a mess.
Okay, so I managed to get Windows builds working. This actually revealed an issue in delvewheel but @adang1345 released a fixed version immediately!
I will send in a PR shortly.
% uname -sKU
FreeBSD 1400053 1400053
% which pdftotext
/usr/local/bin/pdftotext
% pkg provides /usr/local/bin/pdftotext
Name : poppler-utils-21.12.0
Desc : Poppler's xpdf-workalike command line utilities
Repo : FreeBSD
Filename: usr/local/bin/pdftotext
% pkg info --list textproc/py-pdftotext
py38-pdftotext-2.2.2:
/usr/local/lib/python3.8/site-packages/pdftotext-2.2.2-py3.8.egg-info/PKG-INFO
/usr/local/lib/python3.8/site-packages/pdftotext-2.2.2-py3.8.egg-info/SOURCES.txt
/usr/local/lib/python3.8/site-packages/pdftotext-2.2.2-py3.8.egg-info/dependency_links.txt
/usr/local/lib/python3.8/site-packages/pdftotext-2.2.2-py3.8.egg-info/top_level.txt
/usr/local/lib/python3.8/site-packages/pdftotext.cpython-38.so
/usr/local/share/licenses/py38-pdftotext-2.2.2/LICENSE
/usr/local/share/licenses/py38-pdftotext-2.2.2/MIT
/usr/local/share/licenses/py38-pdftotext-2.2.2/catalog.mk
%
https://www.freshports.org/graphics/poppler-utils/
https://www.freshports.org/textproc/py-pdftotext/