pdf2image
pdf2image copied to clipboard
Page number duplicated in multi-page PDFs
Describe the bug
Given a multi-page PDF, the page number is encoded twice in the output file name: once by pdf2image and again by pdftoppm/pdftocairo.
To Reproduce Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See error
(1) Download multipage.pdf
(2) Run this code from the same directory as multipage.pdf:
import pathlib
from pdf2image import convert_from_path
pdf_file = pathlib.Path(r"./multipage.pdf")
convert_from_path(pdf_file, output_folder=".", output_file=pdf_file.stem, fmt='jpeg')
(3) The previous step should produce 10 JPG files. Notice the filename of each follows format: {PPM-root}{PPPP}-{number}.jpg
Expected behavior
Filenames should only have the page number encoded once (which the pdfto* already handles): {PPM-root}-{number}.jpg
Screenshots
File tree showing outputs for pdf2image, pdftoppm, and pdftocairo:
│ driver.py
│ multipage.pdf
│
├───output_pdf2image
│ multipage0001-01.jpg
│ multipage0001-02.jpg
│ multipage0001-03.jpg
│ multipage0001-04.jpg
│ multipage0001-05.jpg
│ multipage0001-06.jpg
│ multipage0001-07.jpg
│ multipage0001-08.jpg
│ multipage0001-09.jpg
│ multipage0001-10.jpg
│
├───output_pdftocairo
│ multipage-01.jpg
│ multipage-02.jpg
│ multipage-03.jpg
│ multipage-04.jpg
│ multipage-05.jpg
│ multipage-06.jpg
│ multipage-07.jpg
│ multipage-08.jpg
│ multipage-09.jpg
│ multipage-10.jpg
│
└───output_pdftoppm
multipage-01.jpg
multipage-02.jpg
multipage-03.jpg
multipage-04.jpg
multipage-05.jpg
multipage-06.jpg
multipage-07.jpg
multipage-08.jpg
multipage-09.jpg
multipage-10.jpg
Desktop (please complete the following information):
- OS: Windows
- 1.16.0
Workaround
I think the issue is with counter_generator. If we pass a generator for output_file
, then counter_generator is never called and we can produce the expected outputs:
import pathlib
from pdf2image import convert_from_path
pdf_file = pathlib.Path(r"./multipage.pdf")
def constant_generator():
while True:
yield pdf_file.stem
convert_from_path(pdf_file, output_folder=".", output_file=constant_generator(), fmt='jpeg')
I saw this behavior on a project yesterday - like you, I wasn't expecting that output in the file names. I checked generators.py
to look at the counter_generator
function. If you look more closely at the output file names, it's not duplicating page numbers - rather, it's appending the number of the thread that handles the page conversion.
A simple fix is to change this in generators.py
:
@threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
"""Returns a joined prefix, iteration number, and suffix"""
i = 0
while True:
i += 1
yield str(prefix) + str(i).zfill(padding_goal) + str(suffix)
to:
@threadsafe
def counter_generator(prefix="", suffix="", padding_goal=4):
"""Returns a joined prefix, iteration number, and suffix"""
i = 0
while True:
i += 1
yield str(prefix) + str(suffix)
Looks like there's a PR out waiting on merge to do just that and a bit more.