Command line version of convert_from_path so you can call pdf2image directly inside a Docker image
Describe the bug Your module is a great help when working with PDFs and having to pull images out of them and I just wrote a script to be able to run convert_from_path from the command line rather than import it inside of an existing python script. Why, you might ask? Here's my current scenario.
I am running into issues where getting the latest version of Poppler installed locally. Most recently working with an AWS AMI that is running Amazon Linux 2. That is essentially CentOS7. It looks like 0.26.5 (Sep 2014) is the last version available for CentOS7, 0.66.0 (June 2018) for CentOS8, and 20.11.0 (Nov 2020) for CentOS8 Stream according to https://pkgs.org/download/poppler-utils 21.03 is the latest (March 2021)
I tried, unsuccessfully, to build my own versions of the libraries through a bunch of http://www.linuxfromscratch.org articles and a lot of prereqs. The biggest issue that I've been finding is that the version that I build is not being used and the version that was installed via yum is, so there are a bunch of version dependencies that I've been trying to address that are not being recognized. I don't want to mess with yum and screw everything else up.
So I've gone down the path of Docker...one of those things that I know that I should have learned but never got around to it. It is the perfect solution.
The last part is that I need a way to issue a command to the Poppler Docker image that will run pdftoppm or better yet convert_from_path. Not sure how familiar you are with Docker but you can issue a command to Docker via an interactive terminal like docker exec -it poppler <script>. It would be amazing if you could:
docker exec -it poppler convert_from_path.py /tmp/abc.pdf --thread_count 4 --size None 2400 --output_folder /tmp/done
You can mount the host /tmp folder as a volume so that the Docker image can read and write to it.
I can now leave all of the code that I have as is, change just the line that used to call pdftoimage.convert_from_path() to call convert_from_path.py on the Docker image (via the docker module) and everything else stays exactly the same. All of my environments can run the same version of Poppler and no more issues with what host OS I am using.
Best of all, you could host docker image repos that contain multiple popular version of Poppler. They are dead easy to setup and more than happy to work with you on doing that. Right, now I have a private image with the latest version of Poppler. Check https://stackoverflow.com/questions/61272431/installing-poppler-utils-of-version-0-82-in-docker#63265495 for the Dockerfile I used as a template.
This will not work on Windows. Unix only. Same with the containers, they are Unix only. Although someone could make a Windows version if they were so inclined.
Desktop (please complete the following information):
- OS: Linux
A PR with the script is coming soon.