PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

Different block extraction results between Windows (local) and Docker using DocstrumBoundingBoxes.Instance

Open Edouard-Tby opened this issue 2 months ago • 4 comments

I am experiencing inconsistent results when extracting text blocks from a PDF using DocstrumBoundingBoxes.Instance as the page segmenter. The issue occurs when running the same PDF processing code locally on Windows and in production on Docker.

Expected behavior The block should be correctly extracted into five lines as it is locally on Windows:

D*****************S
5 RUE P*****L
93200 SAINT-DENIS
REPRENTE PAR D******* I**********
SIRET : 989 288 774 00015

Actual behavior When running inside Docker, the extracted blocks are incorrectly split and partially scrambled:

D*****************S
5 RUE P*****L
93200
SAINT-DENIS
REPRENTE PAR D******* I**********
SIRET : 288 00015
989 774

Environment

  • UglyToad version: 0.1.11
  • OS (local): Windows 11
  • Docker base image : mcr.microsoft.com/dotnet/aspnet:9.0
  • .NET runtime version: 9.0

Additional information

Both environments use the same code and dependencies. The attached PDF exhibits the problem reproducibly. I tried both DefaultWordExtractor and NearestNeighbourWordExtractor. This suggests a possible difference in floating-point precision, font rendering, or locale behavior between Windows and Linux environments.

EXAMPLE (2) (1).pdf

Edouard-Tby avatar Nov 05 '25 09:11 Edouard-Tby

@Edouard-Tby thanks for opening the issue.

This is most certainly due to font differences accross OS. This was discussed here too https://github.com/UglyToad/PdfPig/issues/840

In my opinion the solution would be to create a nuget package that comes with fonts, consistent across OS (see issue mentioned)

Ill leave this one open for the moment, and try to find to to work on that

BobLd avatar Nov 10 '25 10:11 BobLd

@Edouard-Tby thanks for opening the issue.

This is most certainly due to font differences accross OS. This was discussed here too #840

In my opinion the solution would be to create a nuget package that comes with fonts, consistent across OS (see issue mentioned)

Ill leave this one open for the moment, and try to find to to work on that

Agreed. I strongly recommend creating a separate NuGet package that includes the Liberation fonts. It would only increase the package size by about ~~20+MB~~ 4MB but would resolve most of the issues. This would allow users to choose between the two packages without affecting backward compatibility.

lihuanglx avatar Nov 12 '25 09:11 lihuanglx

Thanks for your replies.

In the meantime, I added this to my Dockerfile, and it seems to have fixed the issue.


# 🧩 Install necessary TTF fonts (PdfPig will use them)
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    fontconfig \
    fonts-dejavu \
    fonts-liberation \
    fonts-liberation2 && \
    fc-cache -fv && \
    # Verify Liberation fonts installed, fail build if not found
    fc-list | grep -i liberation || (echo "Liberation fonts not found!" && exit 1) && \
    rm -rf /var/lib/apt/lists/*

Edouard-Tby avatar Nov 12 '25 14:11 Edouard-Tby

Adding your solution to the wiki

BobLd avatar Nov 13 '25 08:11 BobLd