Text with pypdf rendered incorrectly
Describe the bug
Error details
At page 7 of output.pdf, in the bottom left corner, Acrobat Reader renders
Instead of rendering "Version of 2025-09-09", like in the other pages.
The exact artifact and the exact page in which it happens seem to be chaging from run to run.
For instance, in another run, page 7 is fine, but at page 11, at the upper center border, I get
Instead of "WORKING PAPER - PLEASE DO NOT DISTRIBUTE".
In another run at page 19, in the bottom right corner, I get
instead of the page number.
Minimal code
This script takes input.pdf (shared above) and generates output.pdf by:
- Inserting "WORKING PAPER - PLEASE DO NOT DISTRIBUTE" in the upper center border
- The current date in the bottom left corner
- The page number in the bottom right corner
So you need to download input.pdf shared above and put it in the same folder as the Python script.
from datetime import datetime
from contextlib import contextmanager
from fpdf import FPDF, get_scale_factor
from pypdf import PdfWriter, PdfReader
import io
from dataclasses import dataclass
from pathlib import Path
from tqdm import tqdm
import requests, zipfile, io
# Download Aptos font from Microsoft
zip_file_url = "https://download.microsoft.com/download/8/6/0/860a94fa-7feb-44ef-ac79-c072d9113d69/Microsoft%20Aptos%20Fonts.zip"
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("./aptos")
font_path = Path("./aptos/Aptos.ttf")
assert font_path.is_file()
# Combine fpdf2 with pypdf, see https://py-pdf.github.io/fpdf2/CombineWithPypdf.html#combine-with-pypdf
@contextmanager
def add_to_page(reader_page, unit="mm"):
k = get_scale_factor(unit)
format = (reader_page.mediabox[2] / k, reader_page.mediabox[3] / k)
pdf = FPDF(format=format, unit=unit)
pdf.add_page()
yield pdf
page_overlay = PdfReader(io.BytesIO(pdf.output())).pages[0]
reader_page.merge_page(page2=page_overlay)
# Text object dataclass
@dataclass
class TextObj:
text: str
coords: list
# Add text objects to a page
def add_text_objs(text_objs, writer, pagei):
for text_obj in text_objs:
with add_to_page(writer.pages[pagei]) as pdf:
pdf.add_font(family="my_aptos", fname=font_path)
pdf.set_font(family="my_aptos", size=9)
pdf.text(x=text_obj.coords[0], y=text_obj.coords[1], text=text_obj.text)
# Convert coordinates in mm
def pdfbox2mm(box):
return [float(coord) / get_scale_factor("mm") for coord in box]
reader = PdfReader("input.pdf")
writer = PdfWriter()
pages = list(reader.pages)
for pagei, page in tqdm(enumerate(pages), total=len(pages)):
writer.add_page(page)
mediabox_mm = pdfbox2mm(page.mediabox)
anns = list()
# Add page numbers
# In pypdf2, (0,0) is the bottom-left corner.
# In fpdf2, (0,0) is the upper-left corner.
coords = [mediabox_mm[2] * 0.9, mediabox_mm[3] * 0.96]
text = f"{pagei + 1} / {len(reader.pages)}"
ann = TextObj(text=text, coords=coords)
anns.append(ann)
# Add compilation time
coords = [mediabox_mm[2] * 0.05, mediabox_mm[3] * 0.96]
text = f"Version of {datetime.now().strftime('%Y-%m-%d')}"
ann = TextObj(text=text, coords=coords)
anns.append(ann)
# Add disclaimer
coords = [mediabox_mm[2] * 0.35, mediabox_mm[3] * 0.05]
text = f"WORKING PAPER - PLEASE DO NOT DISTRIBUTE"
ann = TextObj(text=text, coords=coords)
anns.append(ann)
add_text_objs(anns, writer, pagei)
writer.write("output.pdf")
print("Output PDF created")
Environment Please provide the following information:
- Operating System: Windows
- Python version: 3.13.3
fpdf2version used: fpdf2==2.8.4, pypdf==5.8.0
Still happens with git version of fpdf2.
Hi @raffaem,
I wasn't able to reproduce the bug - at least with python 3.12 and pypdf 6.0.0 on Ubuntu.
One thing I noticed though is you are generating 3 different pages in fpdf - each page in your resulting file is the result of merging 4 pages.
Try the small change bellow - you will see the output.pdf file will be much smaller - and check if it also solves your bug:
# Add text objects to a page
def add_text_objs(text_objs, writer):
with add_to_page(writer.pages[pagei]) as pdf:
pdf.add_font(family="my_aptos", fname=font_path)
pdf.set_font(family="my_aptos", size=9)
for text_obj in text_objs:
pdf.text(x=text_obj.coords[0], y=text_obj.coords[1], text=text_obj.text)
If the change above don't fix the bug you are experiencing, please change as below:
@contextmanager
def add_to_page(reader_page, pagei, unit="mm"):
k = get_scale_factor(unit)
format = (reader_page.mediabox[2] / k, reader_page.mediabox[3] / k)
pdf = FPDF(format=format, unit=unit)
pdf.add_page()
yield pdf
fpdf_output = io.BytesIO(pdf.output())
page_overlay = PdfReader(fpdf_output).pages[0]
reader_page.merge_page(page2=page_overlay)
fpdf_output.seek(0)
with open(f"output_fpdf{pagei}.pdf", "wb") as f:
f.write(fpdf_output.getvalue())
# Add text objects to a page
def add_text_objs(text_objs, writer, pagei):
with add_to_page(writer.pages[pagei], pagei) as pdf:
pdf.add_font(family="my_aptos", fname=font_path)
pdf.set_font(family="my_aptos", size=9)
for text_obj in text_objs:
pdf.text(x=text_obj.coords[0], y=text_obj.coords[1], text=text_obj.text)
This code will output the fpdf page as separate pdf files, then check if the bugs are present on the fpdf files.
- If you are seeing the bugs on the individual
fpdffiles, please report back so we can investigate further - If you don't see the bug on the individual
fpdffiles - only on the final output - the problem ispypdfmerging function - probably a problem with the font object identifiers on both files during the merging. In this case we need to check if the problem exists in the latest version ofpypdfand open an issue on their repository.
Hello @andersonhc,
Thanks very much for the fast answer!
With your small change, the bug still happens, but not on the previous sample pdf file, but on this new sample file.
I changed your code a bit to reflect what I'm doing (setting the font at every page), and to have the indivual files numbered starting from 1.
Using your debug add_to_page that write to individual files, the bug doesn't happen on the individual PDF files.
For instance, here is the header
the left footer
and the right footer
of page 36 of output.pdf.
However, the individual PDF file output_fpdf36.pdf looks fine (notice it contains only the text we are adding, but not the text already present):
Moreover, maybe I'm starting to see a pattern, that is, the text we add must have the same font of the text already present in the PDF.
At least that would explain why the bug doesn't happen with the initial sample file.
Here are the input sample file, the output file, the individual page output file, and the new code I'm using.
Moreover, I have pypdf 6.0.0 and fpdf 2.8.4.
Do you see the bug with my output.pdf file, in any case?
Are you still unable to reproduce it?
from datetime import datetime
from contextlib import contextmanager
import pypdf
import fpdf
from fpdf import FPDF, get_scale_factor
from pypdf import PdfWriter, PdfReader
import io
from dataclasses import dataclass
from pathlib import Path
from tqdm import tqdm
import requests, zipfile, io
print(pypdf.__version__)
print(fpdf.__version__)
# Download Aptos font from Microsoft
zip_file_url = "https://download.microsoft.com/download/8/6/0/860a94fa-7feb-44ef-ac79-c072d9113d69/Microsoft%20Aptos%20Fonts.zip"
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("./aptos")
font_path = Path("./aptos/Aptos.ttf")
assert font_path.is_file()
# Combine fpdf2 with pypdf, see https://py-pdf.github.io/fpdf2/CombineWithPypdf.html#combine-with-pypdf
# @contextmanager
# def add_to_page(reader_page, pagei, unit="mm"):
# k = get_scale_factor(unit)
# format = (reader_page.mediabox[2] / k, reader_page.mediabox[3] / k)
# pdf = FPDF(format=format, unit=unit)
# pdf.add_page()
# yield pdf
# page_overlay = PdfReader(io.BytesIO(pdf.output())).pages[0]
# reader_page.merge_page(page2=page_overlay)
# individual file output (debug purposes)
@contextmanager
def add_to_page(reader_page, pagei, unit="mm"):
k = get_scale_factor(unit)
format = (reader_page.mediabox[2] / k, reader_page.mediabox[3] / k)
pdf = FPDF(format=format, unit=unit)
pdf.add_page()
yield pdf
fpdf_output = io.BytesIO(pdf.output())
page_overlay = PdfReader(fpdf_output).pages[0]
reader_page.merge_page(page2=page_overlay)
fpdf_output.seek(0)
with open(f"output_fpdf{pagei+1}.pdf", "wb") as f:
f.write(fpdf_output.getvalue())
# Text object dataclass
@dataclass
class TextObj:
text: str
coords: list
font_family: str = "Aptos"
font_size: int = 10
font_path: Path = font_path
# Add text objects to a page
def add_text_objs(text_objs, writer, pagei):
with add_to_page(writer.pages[pagei], pagei) as pdf:
for text_obj in text_objs:
pdf.add_font(family=text_obj.font_family, fname=text_obj.font_path)
pdf.set_font(family=text_obj.font_family, size=text_obj.font_size)
pdf.text(x=text_obj.coords[0], y=text_obj.coords[1], text=text_obj.text)
# Convert coordinates in mm
def pdfbox2mm(box):
return [float(coord) / get_scale_factor("mm") for coord in box]
reader = PdfReader("input.pdf")
writer = PdfWriter()
pages = list(reader.pages)
for pagei, page in tqdm(enumerate(pages), total=len(pages)):
writer.add_page(page)
mediabox_mm = pdfbox2mm(page.mediabox)
anns = list()
# Add page numbers
# In pypdf2, (0,0) is the bottom-left corner.
# In fpdf2, (0,0) is the upper-left corner.
coords = [mediabox_mm[2] * 0.9, mediabox_mm[3] * 0.96]
text = f"{pagei + 1} / {len(reader.pages)}"
ann = TextObj(text=text, coords=coords)
anns.append(ann)
# Add compilation time
coords = [mediabox_mm[2] * 0.05, mediabox_mm[3] * 0.96]
text = f"Snapshot of {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
ann = TextObj(text=text, coords=coords)
anns.append(ann)
# Add disclaimer
coords = [mediabox_mm[2] * 0.35, mediabox_mm[3] * 0.03]
text = f"WORKING PAPER - PLEASE DO NOT DISTRIBUTE"
ann = TextObj(text=text, coords=coords)
anns.append(ann)
add_text_objs(anns, writer, pagei)
writer.write("output.pdf")
print("Output PDF created")
I am able to reproduce it now.
It looks like a problem with pypdf's PageObject.merge_page() when merging the fonts.
I will try to investigate further, but I believe you should open an issue on their repository too.
I am able to reproduce it now. It looks like a problem with
pypdf'sPageObject.merge_page()when merging the fonts. I will try to investigate further, but I believe you should open an issue on their repository too.
Thank you very much.
I opened an issue on pypdf repository too.
I wouldn't bet money on the fact that is a pypdf bug.
If I take the single page PDFs we generated with your code, and merge them with the input file PDF pages one by one, it works correctly:
from pypdf import PdfReader, PdfWriter
reader_base = PdfReader("input.pdf")
# Write the result back
for pagei, page_base in enumerate(reader_base.pages):
reader = PdfReader(f"output_fpdf_{pagei+1}.pdf")
page_box = reader.pages[0]
page_base.merge_page(page_box)
writer = PdfWriter()
writer.add_page(page_base)
with open(f"output_fpdf_{pagei+1}_merged.pdf", "wb") as fp:
writer.write(fp)
Please try the snippet below and let me know if it works on your side:
from datetime import datetime
import pypdf
import fpdf
from fpdf import FPDF, get_scale_factor
from pypdf import PdfWriter, PdfReader
import io
from dataclasses import dataclass
from pathlib import Path
from tqdm import tqdm
import requests, zipfile, io
print(pypdf.__version__)
print(fpdf.__version__)
# Download Aptos font from Microsoft
zip_file_url = "https://download.microsoft.com/download/8/6/0/860a94fa-7feb-44ef-ac79-c072d9113d69/Microsoft%20Aptos%20Fonts.zip"
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("./aptos")
font_path = Path("./aptos/Aptos.ttf")
assert font_path.is_file()
def create_overlay(width, height, page_count, unit="mm") -> io.BytesIO:
pdf = FPDF(format=(width, height), unit=unit)
pdf.add_font(family="Aptos", fname=font_path)
pdf.set_font(family="Aptos", size=10)
for i in range(page_count):
pdf.add_page()
pdf.text(x=width * 0.9, y=height * 0.96, text=f"{i + 1} / {page_count}")
pdf.text(x=width * 0.05, y=height * 0.96, text=f"Snapshot of {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
pdf.text(x=width * 0.35, y=height * 0.03, text=f"WORKING PAPER - PLEASE DO NOT DISTRIBUTE")
return io.BytesIO(pdf.output())
# Convert coordinates in mm
def pdfbox2mm(box):
return [float(coord) / get_scale_factor("mm") for coord in box]
reader = PdfReader("input.pdf")
writer = PdfWriter()
pages = list(reader.pages)
mediabox_mm = pdfbox2mm(pages[0].mediabox)
overlay = create_overlay(mediabox_mm[2], mediabox_mm[3], len(pages))
overlay_reader = PdfReader(overlay)
for i in range(len(pages)):
pages[i].merge_page(page2=overlay_reader.pages[i])
writer.add_page(pages[i])
writer.write("output.pdf")
print("Output PDF created")
This version should help in two ways:
- It avoids the possible garbage-collection pitfall we suspect (I’ll keep digging to confirm the root cause).
- It produces a much smaller PDF and reduces the chance of font-object bloat, explained below.
When a PDF is created with an embedded font, we don’t embed the entire font. We subset it to only the glyphs actually used, remap the code points, and write that subset. This keeps files smaller and rendering faster.
In your previous approach, each overlay was rendered into its own PDF, so when pypdf merged them, it ended up with one font subset per input PDF, even if those subsets contained the same glyphs. That’s why your output had hundreds of font objects and the output PDF was so big.
With the new code, one single PDF is built that contains all overlay pages, so the same embedded font subset is reused across overlays. Because pypdf sees the same font object, it doesn’t duplicate it, and the resulting file is much smaller.
If anything still looks off please let me know.
Hello,
nothing looks off running this on 3.13.
Thank you very much for the help and the explanations.
The explanations are really useful ... I just copy-pasted the example in the guide without thinking too much on what would happen underneath.
In reality Aptos should be the font used by the input PDF, so I think the next step would be to reuse that ... but that's probably asking too much :) (not a problem for me anyway, it's true the original code wrote a huge PDF but I used to compress it with ghostscript, I was talking from a purely optimization perspective)
Hello,
A slightly modified version of your code, to handle pages with different orientation, doesn't work with this input file.
The second page in the output file is empty.
(Notice this is an entirely new input file).
from datetime import datetime
import pypdf
import fpdf
from fpdf import FPDF, get_scale_factor
from pypdf import PdfWriter, PdfReader
import io
from dataclasses import dataclass
from pathlib import Path
from tqdm import tqdm
import requests, zipfile, io
print(pypdf.__version__)
print(fpdf.__version__)
# Download Aptos font from Microsoft
zip_file_url = "https://download.microsoft.com/download/8/6/0/860a94fa-7feb-44ef-ac79-c072d9113d69/Microsoft%20Aptos%20Fonts.zip"
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("./aptos")
font_file = Path("./aptos/Aptos.ttf")
assert font_file.is_file()
font_family = "my_aptos"
font_size = 9
def create_overlay(mediaboxes_mm, page_count, unit="mm") -> io.BytesIO:
assert len(mediaboxes_mm) == page_count
pdf = FPDF(format=(mediaboxes_mm[0][2], mediaboxes_mm[0][3]), unit=unit)
pdf.add_font(family=font_family, fname=font_file)
pdf.set_font(family=font_family, size=font_size)
for i in range(page_count):
width = mediaboxes_mm[i][2]
height = mediaboxes_mm[i][3]
pdf.add_page(format=(width, height))
# In FPDF2, (0,0) is the top-left corner
pdf.text(x=width * 0.9, y=height * 0.96, text=f"{i + 1} / {page_count}")
pdf.text(x=width * 0.05, y=height * 0.96, text=f"Snapshot of {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
pdf.text(x=width * 0.35, y=height * 0.03, text=f"WORKING PAPER - PLEASE DO NOT DISTRIBUTE")
# <DEBUG>
# pdf.output("debug.pdf")
# </DEBUG>
return io.BytesIO(pdf.output())
# Convert coordinates in mm
def pdfbox2mm(box):
return [float(coord) / get_scale_factor("mm") for coord in box]
input_reader = PdfReader("input.pdf")
output_writer = PdfWriter()
input_pages = list(input_reader.pages)
mediaboxes_mm = [pdfbox2mm(page.mediabox) for page in input_pages]
overlay = create_overlay(mediaboxes_mm, len(input_pages))
overlay_reader = PdfReader(overlay)
for i in range(len(input_pages)):
input_pages[i].merge_page(page2=overlay_reader.pages[i])
output_writer.add_page(input_pages[i])
output_writer.write("output.pdf")
I am at a loss here...
I wrote the content of overlay produced by fpdf2 to a file and it looks fine:
input-2.pdf output-2-fpdf-overlay.pdf
and I tried this minimal code with pypdf only merging the files, and the second page is still completely blank:
from pypdf import PdfReader, PdfWriter
input_reader = PdfReader("input-2.pdf")
overlay_reader = PdfReader("output-2-fpdf-overlay.pdf")
output_writer = PdfWriter()
input_pages = list(input_reader.pages)
for i in range(len(input_pages)):
input_pages[i].merge_page(page2=overlay_reader.pages[i])
output_writer.add_page(input_pages[i])
output_writer.write("output-merged.pdf")
I guess we'll need @stefan6419846 's help.
I remember that we had a similar issue in our issue tracker some time ago with the similar side effect that the first page would have some strange influence on the second one - and when skipping the first page, everything would be correct. I did not yet have the time to look into possible differences from the merge process in both cases for your example, but the following alternative code (which effectively does the same as yours) generates correct results for me:
from pypdf import PdfReader, PdfWriter
input_reader = PdfReader("input-2.pdf")
overlay_reader = PdfReader("output-2-fpdf-overlay.pdf")
output_writer = PdfWriter(clone_from=input_reader)
for i, input_page in enumerate(output_writer.pages):
overlay_page = overlay_reader.pages[i]
input_page.merge_page(page2=overlay_page)
output_writer.write("output-merged.pdf")
Please note that if you additionally might have to deal with rotated pages, you should add corresponding handling according to the docs.
I remember that we had a similar issue in our issue tracker some time ago with the similar side effect that the first page would have some strange influence on the second one - and when skipping the first page, everything would be correct.
Yes that is our case.
Is it a bug? It my code supposed to work? Should I open an issued on pypdf?
the following alternative code (which effectively does the same as yours) generates correct results for me
Thank you! Finally I have something that works in production (I tried with the full PDF that input PDF was part of)
Please note that if you additionally might have to deal with rotated pages, you should add corresponding handling according to the docs.
It worked on landscape pages without that additional part
Rotated pages means pages with page.rotation != 0. The rotation can be part of the page content stream itself, which does not have this issue. If everything works correctly for you, you most likely have page.rotation == 0 everywhere.
Is it a bug? It my code supposed to work? Should I open an issued on pypdf?
I will try to investigate this before opening a issue with proper details if I find some time to do so.
Upstream issue: https://github.com/py-pdf/pypdf/issues/2260
The data is there, but for some reasons, we have an unencoded content stream specifying a filter, which fails to render (and to read) in pypdf accordingly.
assign this to me
@BharathPESU No need to assign anyone here - nevertheless, you are of course invited to further investigate the corresponding issue and look for a suitable solution.