fpdf2 icon indicating copy to clipboard operation
fpdf2 copied to clipboard

Text with pypdf rendered incorrectly

Open raffaem opened this issue 3 months ago • 16 comments

Describe the bug

Error details

input.pdf

output.pdf

At page 7 of output.pdf, in the bottom left corner, Acrobat Reader renders

Image

Instead of rendering "Version of 2025-09-09", like in the other pages.

The exact artifact and the exact page in which it happens seem to be chaging from run to run.

For instance, in another run, page 7 is fine, but at page 11, at the upper center border, I get

Image

Instead of "WORKING PAPER - PLEASE DO NOT DISTRIBUTE".

In another run at page 19, in the bottom right corner, I get

Image

instead of the page number.

Minimal code

This script takes input.pdf (shared above) and generates output.pdf by:

  1. Inserting "WORKING PAPER - PLEASE DO NOT DISTRIBUTE" in the upper center border
  2. The current date in the bottom left corner
  3. The page number in the bottom right corner

So you need to download input.pdf shared above and put it in the same folder as the Python script.

from datetime import datetime
from contextlib import contextmanager
from fpdf import FPDF, get_scale_factor
from pypdf import PdfWriter, PdfReader
import io
from dataclasses import dataclass
from pathlib import Path
from tqdm import tqdm
import requests, zipfile, io

# Download Aptos font from Microsoft
zip_file_url = "https://download.microsoft.com/download/8/6/0/860a94fa-7feb-44ef-ac79-c072d9113d69/Microsoft%20Aptos%20Fonts.zip"
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("./aptos")
font_path = Path("./aptos/Aptos.ttf")
assert font_path.is_file()

# Combine fpdf2 with pypdf, see https://py-pdf.github.io/fpdf2/CombineWithPypdf.html#combine-with-pypdf
@contextmanager
def add_to_page(reader_page, unit="mm"):
    k = get_scale_factor(unit)
    format = (reader_page.mediabox[2] / k, reader_page.mediabox[3] / k)
    pdf = FPDF(format=format, unit=unit)
    pdf.add_page()
    yield pdf
    page_overlay = PdfReader(io.BytesIO(pdf.output())).pages[0]
    reader_page.merge_page(page2=page_overlay)

# Text object dataclass
@dataclass
class TextObj:
    text: str
    coords: list

# Add text objects to a page
def add_text_objs(text_objs, writer, pagei):
    for text_obj in text_objs:
        with add_to_page(writer.pages[pagei]) as pdf:
            pdf.add_font(family="my_aptos", fname=font_path)
            pdf.set_font(family="my_aptos", size=9)
            pdf.text(x=text_obj.coords[0], y=text_obj.coords[1], text=text_obj.text)

# Convert coordinates in mm
def pdfbox2mm(box):
    return [float(coord) / get_scale_factor("mm") for coord in box]

reader = PdfReader("input.pdf")
writer = PdfWriter()
pages = list(reader.pages)

for pagei, page in tqdm(enumerate(pages), total=len(pages)):

    writer.add_page(page)
    mediabox_mm = pdfbox2mm(page.mediabox)
    anns = list()

    # Add page numbers
    # In pypdf2, (0,0) is the bottom-left corner.
    # In fpdf2, (0,0) is the upper-left corner.
    coords = [mediabox_mm[2] * 0.9, mediabox_mm[3] * 0.96]
    text = f"{pagei + 1} / {len(reader.pages)}"
    ann = TextObj(text=text, coords=coords)
    anns.append(ann)

    # Add compilation time
    coords = [mediabox_mm[2] * 0.05, mediabox_mm[3] * 0.96]
    text = f"Version of {datetime.now().strftime('%Y-%m-%d')}"
    ann = TextObj(text=text, coords=coords)
    anns.append(ann)

    # Add disclaimer
    coords = [mediabox_mm[2] * 0.35, mediabox_mm[3] * 0.05]
    text = f"WORKING PAPER - PLEASE DO NOT DISTRIBUTE"
    ann = TextObj(text=text, coords=coords)
    anns.append(ann)

    add_text_objs(anns, writer, pagei)

writer.write("output.pdf")
print("Output PDF created")

Environment Please provide the following information:

  • Operating System: Windows
  • Python version: 3.13.3
  • fpdf2 version used: fpdf2==2.8.4, pypdf==5.8.0

Still happens with git version of fpdf2.

raffaem avatar Sep 09 '25 00:09 raffaem

Hi @raffaem,

I wasn't able to reproduce the bug - at least with python 3.12 and pypdf 6.0.0 on Ubuntu.

One thing I noticed though is you are generating 3 different pages in fpdf - each page in your resulting file is the result of merging 4 pages.

Try the small change bellow - you will see the output.pdf file will be much smaller - and check if it also solves your bug:

# Add text objects to a page
def add_text_objs(text_objs, writer):
    with add_to_page(writer.pages[pagei]) as pdf:
        pdf.add_font(family="my_aptos", fname=font_path)
        pdf.set_font(family="my_aptos", size=9)
        for text_obj in text_objs:
            pdf.text(x=text_obj.coords[0], y=text_obj.coords[1], text=text_obj.text)

If the change above don't fix the bug you are experiencing, please change as below:

@contextmanager
def add_to_page(reader_page, pagei, unit="mm"):
    k = get_scale_factor(unit)
    format = (reader_page.mediabox[2] / k, reader_page.mediabox[3] / k)
    pdf = FPDF(format=format, unit=unit)
    pdf.add_page()
    yield pdf
    fpdf_output = io.BytesIO(pdf.output())
    page_overlay = PdfReader(fpdf_output).pages[0]
    reader_page.merge_page(page2=page_overlay)
    fpdf_output.seek(0)
    with open(f"output_fpdf{pagei}.pdf", "wb") as f:
        f.write(fpdf_output.getvalue())

# Add text objects to a page
def add_text_objs(text_objs, writer, pagei):
    with add_to_page(writer.pages[pagei], pagei) as pdf:
        pdf.add_font(family="my_aptos", fname=font_path)
        pdf.set_font(family="my_aptos", size=9)
        for text_obj in text_objs:
            pdf.text(x=text_obj.coords[0], y=text_obj.coords[1], text=text_obj.text)

This code will output the fpdf page as separate pdf files, then check if the bugs are present on the fpdf files.

  • If you are seeing the bugs on the individual fpdf files, please report back so we can investigate further
  • If you don't see the bug on the individual fpdf files - only on the final output - the problem is pypdf merging function - probably a problem with the font object identifiers on both files during the merging. In this case we need to check if the problem exists in the latest version of pypdf and open an issue on their repository.

andersonhc avatar Sep 09 '25 03:09 andersonhc

Hello @andersonhc,

Thanks very much for the fast answer!

With your small change, the bug still happens, but not on the previous sample pdf file, but on this new sample file.

I changed your code a bit to reflect what I'm doing (setting the font at every page), and to have the indivual files numbered starting from 1.

Using your debug add_to_page that write to individual files, the bug doesn't happen on the individual PDF files.

For instance, here is the header

Image

the left footer

Image

and the right footer

Image

of page 36 of output.pdf.

However, the individual PDF file output_fpdf36.pdf looks fine (notice it contains only the text we are adding, but not the text already present):

Image

Image

Image

Moreover, maybe I'm starting to see a pattern, that is, the text we add must have the same font of the text already present in the PDF.

At least that would explain why the bug doesn't happen with the initial sample file.

Here are the input sample file, the output file, the individual page output file, and the new code I'm using.

Moreover, I have pypdf 6.0.0 and fpdf 2.8.4.

Do you see the bug with my output.pdf file, in any case?

Are you still unable to reproduce it?

input.pdf

input.docx

output.pdf

output_fpdf36.pdf

from datetime import datetime
from contextlib import contextmanager
import pypdf
import fpdf
from fpdf import FPDF, get_scale_factor
from pypdf import PdfWriter, PdfReader
import io
from dataclasses import dataclass
from pathlib import Path
from tqdm import tqdm
import requests, zipfile, io

print(pypdf.__version__)
print(fpdf.__version__)  

# Download Aptos font from Microsoft
zip_file_url = "https://download.microsoft.com/download/8/6/0/860a94fa-7feb-44ef-ac79-c072d9113d69/Microsoft%20Aptos%20Fonts.zip"
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("./aptos")
font_path = Path("./aptos/Aptos.ttf")
assert font_path.is_file()

# Combine fpdf2 with pypdf, see https://py-pdf.github.io/fpdf2/CombineWithPypdf.html#combine-with-pypdf
# @contextmanager
# def add_to_page(reader_page, pagei, unit="mm"):
#     k = get_scale_factor(unit)
#     format = (reader_page.mediabox[2] / k, reader_page.mediabox[3] / k)
#     pdf = FPDF(format=format, unit=unit)
#     pdf.add_page()
#     yield pdf
#     page_overlay = PdfReader(io.BytesIO(pdf.output())).pages[0]
#     reader_page.merge_page(page2=page_overlay)

# individual file output (debug purposes)
@contextmanager
def add_to_page(reader_page, pagei, unit="mm"):
    k = get_scale_factor(unit)
    format = (reader_page.mediabox[2] / k, reader_page.mediabox[3] / k)
    pdf = FPDF(format=format, unit=unit)
    pdf.add_page()
    yield pdf
    fpdf_output = io.BytesIO(pdf.output())
    page_overlay = PdfReader(fpdf_output).pages[0]
    reader_page.merge_page(page2=page_overlay)
    fpdf_output.seek(0)
    with open(f"output_fpdf{pagei+1}.pdf", "wb") as f:
        f.write(fpdf_output.getvalue())

# Text object dataclass
@dataclass
class TextObj:
    text: str
    coords: list
    font_family: str = "Aptos"
    font_size: int = 10
    font_path: Path = font_path

# Add text objects to a page
def add_text_objs(text_objs, writer, pagei):
    with add_to_page(writer.pages[pagei], pagei) as pdf:
        for text_obj in text_objs:
            pdf.add_font(family=text_obj.font_family, fname=text_obj.font_path)
            pdf.set_font(family=text_obj.font_family, size=text_obj.font_size)
            pdf.text(x=text_obj.coords[0], y=text_obj.coords[1], text=text_obj.text)

# Convert coordinates in mm
def pdfbox2mm(box):
    return [float(coord) / get_scale_factor("mm") for coord in box]

reader = PdfReader("input.pdf")
writer = PdfWriter()
pages = list(reader.pages)

for pagei, page in tqdm(enumerate(pages), total=len(pages)):

    writer.add_page(page)
    mediabox_mm = pdfbox2mm(page.mediabox)
    anns = list()

    # Add page numbers
    # In pypdf2, (0,0) is the bottom-left corner.
    # In fpdf2, (0,0) is the upper-left corner.
    coords = [mediabox_mm[2] * 0.9, mediabox_mm[3] * 0.96]
    text = f"{pagei + 1} / {len(reader.pages)}"
    ann = TextObj(text=text, coords=coords)
    anns.append(ann)

    # Add compilation time
    coords = [mediabox_mm[2] * 0.05, mediabox_mm[3] * 0.96]
    text = f"Snapshot of {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
    ann = TextObj(text=text, coords=coords)
    anns.append(ann)

    # Add disclaimer
    coords = [mediabox_mm[2] * 0.35, mediabox_mm[3] * 0.03]
    text = f"WORKING PAPER - PLEASE DO NOT DISTRIBUTE"
    ann = TextObj(text=text, coords=coords)
    anns.append(ann)

    add_text_objs(anns, writer, pagei)

writer.write("output.pdf")
print("Output PDF created")

raffaem avatar Sep 09 '25 06:09 raffaem

I am able to reproduce it now. It looks like a problem with pypdf's PageObject.merge_page() when merging the fonts. I will try to investigate further, but I believe you should open an issue on their repository too.

andersonhc avatar Sep 10 '25 10:09 andersonhc

I am able to reproduce it now. It looks like a problem with pypdf's PageObject.merge_page() when merging the fonts. I will try to investigate further, but I believe you should open an issue on their repository too.

Thank you very much.

I opened an issue on pypdf repository too.

raffaem avatar Sep 10 '25 12:09 raffaem

I wouldn't bet money on the fact that is a pypdf bug.

If I take the single page PDFs we generated with your code, and merge them with the input file PDF pages one by one, it works correctly:

from pypdf import PdfReader, PdfWriter

reader_base = PdfReader("input.pdf")

# Write the result back
for pagei, page_base in enumerate(reader_base.pages):

    reader = PdfReader(f"output_fpdf_{pagei+1}.pdf")
    page_box = reader.pages[0]

    page_base.merge_page(page_box)
    writer = PdfWriter()
    writer.add_page(page_base)
    with open(f"output_fpdf_{pagei+1}_merged.pdf", "wb") as fp:
        writer.write(fp)

raffaem avatar Sep 10 '25 13:09 raffaem

Please try the snippet below and let me know if it works on your side:

from datetime import datetime
import pypdf
import fpdf
from fpdf import FPDF, get_scale_factor
from pypdf import PdfWriter, PdfReader
import io
from dataclasses import dataclass
from pathlib import Path
from tqdm import tqdm
import requests, zipfile, io

print(pypdf.__version__)
print(fpdf.__version__)  

# Download Aptos font from Microsoft
zip_file_url = "https://download.microsoft.com/download/8/6/0/860a94fa-7feb-44ef-ac79-c072d9113d69/Microsoft%20Aptos%20Fonts.zip"
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("./aptos")
font_path = Path("./aptos/Aptos.ttf")
assert font_path.is_file()

def create_overlay(width, height, page_count, unit="mm") -> io.BytesIO:
    pdf = FPDF(format=(width, height), unit=unit)
    pdf.add_font(family="Aptos", fname=font_path)
    pdf.set_font(family="Aptos", size=10)
            
    for i in range(page_count):
        pdf.add_page()
      
        pdf.text(x=width * 0.9, y=height * 0.96, text=f"{i + 1} / {page_count}")
        pdf.text(x=width * 0.05, y=height * 0.96, text=f"Snapshot of {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        pdf.text(x=width * 0.35, y=height * 0.03, text=f"WORKING PAPER - PLEASE DO NOT DISTRIBUTE")
    
    return io.BytesIO(pdf.output())
    
# Convert coordinates in mm
def pdfbox2mm(box):
    return [float(coord) / get_scale_factor("mm") for coord in box]

reader = PdfReader("input.pdf")
writer = PdfWriter()
pages = list(reader.pages)
mediabox_mm = pdfbox2mm(pages[0].mediabox)
overlay = create_overlay(mediabox_mm[2], mediabox_mm[3], len(pages))
overlay_reader = PdfReader(overlay)

for i in range(len(pages)):
    pages[i].merge_page(page2=overlay_reader.pages[i])
    writer.add_page(pages[i])

writer.write("output.pdf")
print("Output PDF created")

This version should help in two ways:

  • It avoids the possible garbage-collection pitfall we suspect (I’ll keep digging to confirm the root cause).
  • It produces a much smaller PDF and reduces the chance of font-object bloat, explained below.

When a PDF is created with an embedded font, we don’t embed the entire font. We subset it to only the glyphs actually used, remap the code points, and write that subset. This keeps files smaller and rendering faster.

In your previous approach, each overlay was rendered into its own PDF, so when pypdf merged them, it ended up with one font subset per input PDF, even if those subsets contained the same glyphs. That’s why your output had hundreds of font objects and the output PDF was so big.

With the new code, one single PDF is built that contains all overlay pages, so the same embedded font subset is reused across overlays. Because pypdf sees the same font object, it doesn’t duplicate it, and the resulting file is much smaller.

If anything still looks off please let me know.

andersonhc avatar Sep 11 '25 01:09 andersonhc

Hello,

nothing looks off running this on 3.13.

Thank you very much for the help and the explanations.

The explanations are really useful ... I just copy-pasted the example in the guide without thinking too much on what would happen underneath.

In reality Aptos should be the font used by the input PDF, so I think the next step would be to reuse that ... but that's probably asking too much :) (not a problem for me anyway, it's true the original code wrote a huge PDF but I used to compress it with ghostscript, I was talking from a purely optimization perspective)

raffaem avatar Sep 11 '25 03:09 raffaem

Hello,

A slightly modified version of your code, to handle pages with different orientation, doesn't work with this input file.

The second page in the output file is empty.

(Notice this is an entirely new input file).

input.pdf

output.pdf

from datetime import datetime
import pypdf
import fpdf
from fpdf import FPDF, get_scale_factor
from pypdf import PdfWriter, PdfReader
import io
from dataclasses import dataclass
from pathlib import Path
from tqdm import tqdm
import requests, zipfile, io

print(pypdf.__version__)
print(fpdf.__version__)  

# Download Aptos font from Microsoft
zip_file_url = "https://download.microsoft.com/download/8/6/0/860a94fa-7feb-44ef-ac79-c072d9113d69/Microsoft%20Aptos%20Fonts.zip"
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("./aptos")
font_file = Path("./aptos/Aptos.ttf")
assert font_file.is_file()

font_family = "my_aptos"
font_size = 9

def create_overlay(mediaboxes_mm, page_count, unit="mm") -> io.BytesIO:

    assert len(mediaboxes_mm) == page_count

    pdf = FPDF(format=(mediaboxes_mm[0][2], mediaboxes_mm[0][3]), unit=unit)
    pdf.add_font(family=font_family, fname=font_file)
    pdf.set_font(family=font_family, size=font_size)
            
    for i in range(page_count):
        width = mediaboxes_mm[i][2]
        height = mediaboxes_mm[i][3]

        pdf.add_page(format=(width, height))
      
        # In FPDF2, (0,0) is the top-left corner
        pdf.text(x=width * 0.9, y=height * 0.96, text=f"{i + 1} / {page_count}")
        pdf.text(x=width * 0.05, y=height * 0.96, text=f"Snapshot of {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        pdf.text(x=width * 0.35, y=height * 0.03, text=f"WORKING PAPER - PLEASE DO NOT DISTRIBUTE")
    
    # <DEBUG>
    # pdf.output("debug.pdf")
    # </DEBUG>
    return io.BytesIO(pdf.output())
    
# Convert coordinates in mm
def pdfbox2mm(box):
    return [float(coord) / get_scale_factor("mm") for coord in box]

input_reader = PdfReader("input.pdf")
output_writer = PdfWriter()
input_pages = list(input_reader.pages)
mediaboxes_mm = [pdfbox2mm(page.mediabox) for page in input_pages]
overlay = create_overlay(mediaboxes_mm, len(input_pages))
overlay_reader = PdfReader(overlay)

for i in range(len(input_pages)):
    input_pages[i].merge_page(page2=overlay_reader.pages[i])
    output_writer.add_page(input_pages[i])

output_writer.write("output.pdf")

raffaem avatar Sep 11 '25 04:09 raffaem

I am at a loss here... I wrote the content of overlay produced by fpdf2 to a file and it looks fine:

input-2.pdf output-2-fpdf-overlay.pdf

and I tried this minimal code with pypdf only merging the files, and the second page is still completely blank:

from pypdf import PdfReader, PdfWriter

input_reader = PdfReader("input-2.pdf")
overlay_reader = PdfReader("output-2-fpdf-overlay.pdf")
output_writer = PdfWriter()
input_pages = list(input_reader.pages)

for i in range(len(input_pages)):
    input_pages[i].merge_page(page2=overlay_reader.pages[i])
    output_writer.add_page(input_pages[i])

output_writer.write("output-merged.pdf")

I guess we'll need @stefan6419846 's help.

andersonhc avatar Sep 11 '25 07:09 andersonhc

I remember that we had a similar issue in our issue tracker some time ago with the similar side effect that the first page would have some strange influence on the second one - and when skipping the first page, everything would be correct. I did not yet have the time to look into possible differences from the merge process in both cases for your example, but the following alternative code (which effectively does the same as yours) generates correct results for me:

from pypdf import PdfReader, PdfWriter

input_reader = PdfReader("input-2.pdf")
overlay_reader = PdfReader("output-2-fpdf-overlay.pdf")
output_writer = PdfWriter(clone_from=input_reader)

for i, input_page in enumerate(output_writer.pages):
    overlay_page = overlay_reader.pages[i]
    input_page.merge_page(page2=overlay_page)

output_writer.write("output-merged.pdf")

Please note that if you additionally might have to deal with rotated pages, you should add corresponding handling according to the docs.

stefan6419846 avatar Sep 11 '25 08:09 stefan6419846

I remember that we had a similar issue in our issue tracker some time ago with the similar side effect that the first page would have some strange influence on the second one - and when skipping the first page, everything would be correct.

Yes that is our case.

Is it a bug? It my code supposed to work? Should I open an issued on pypdf?

the following alternative code (which effectively does the same as yours) generates correct results for me

Thank you! Finally I have something that works in production (I tried with the full PDF that input PDF was part of)

Please note that if you additionally might have to deal with rotated pages, you should add corresponding handling according to the docs.

It worked on landscape pages without that additional part

raffaem avatar Sep 11 '25 08:09 raffaem

Rotated pages means pages with page.rotation != 0. The rotation can be part of the page content stream itself, which does not have this issue. If everything works correctly for you, you most likely have page.rotation == 0 everywhere.

stefan6419846 avatar Sep 11 '25 08:09 stefan6419846

Is it a bug? It my code supposed to work? Should I open an issued on pypdf?

I will try to investigate this before opening a issue with proper details if I find some time to do so.

stefan6419846 avatar Sep 11 '25 08:09 stefan6419846

Upstream issue: https://github.com/py-pdf/pypdf/issues/2260

The data is there, but for some reasons, we have an unencoded content stream specifying a filter, which fails to render (and to read) in pypdf accordingly.

stefan6419846 avatar Sep 11 '25 09:09 stefan6419846

assign this to me

BharathPESU avatar Nov 01 '25 11:11 BharathPESU

@BharathPESU No need to assign anyone here - nevertheless, you are of course invited to further investigate the corresponding issue and look for a suitable solution.

stefan6419846 avatar Nov 02 '25 19:11 stefan6419846