Different block extraction results between Windows (local) and Docker using DocstrumBoundingBoxes.Instance
I am experiencing inconsistent results when extracting text blocks from a PDF using DocstrumBoundingBoxes.Instance as the page segmenter. The issue occurs when running the same PDF processing code locally on Windows and in production on Docker.
Expected behavior The block should be correctly extracted into five lines as it is locally on Windows:
D*****************S
5 RUE P*****L
93200 SAINT-DENIS
REPRENTE PAR D******* I**********
SIRET : 989 288 774 00015
Actual behavior When running inside Docker, the extracted blocks are incorrectly split and partially scrambled:
D*****************S
5 RUE P*****L
93200
SAINT-DENIS
REPRENTE PAR D******* I**********
SIRET : 288 00015
989 774
Environment
- UglyToad version: 0.1.11
- OS (local): Windows 11
- Docker base image : mcr.microsoft.com/dotnet/aspnet:9.0
- .NET runtime version: 9.0
Additional information
Both environments use the same code and dependencies. The attached PDF exhibits the problem reproducibly. I tried both DefaultWordExtractor and NearestNeighbourWordExtractor. This suggests a possible difference in floating-point precision, font rendering, or locale behavior between Windows and Linux environments.
@Edouard-Tby thanks for opening the issue.
This is most certainly due to font differences accross OS. This was discussed here too https://github.com/UglyToad/PdfPig/issues/840
In my opinion the solution would be to create a nuget package that comes with fonts, consistent across OS (see issue mentioned)
Ill leave this one open for the moment, and try to find to to work on that
@Edouard-Tby thanks for opening the issue.
This is most certainly due to font differences accross OS. This was discussed here too #840
In my opinion the solution would be to create a nuget package that comes with fonts, consistent across OS (see issue mentioned)
Ill leave this one open for the moment, and try to find to to work on that
Agreed. I strongly recommend creating a separate NuGet package that includes the Liberation fonts. It would only increase the package size by about ~~20+MB~~ 4MB but would resolve most of the issues. This would allow users to choose between the two packages without affecting backward compatibility.
Thanks for your replies.
In the meantime, I added this to my Dockerfile, and it seems to have fixed the issue.
# 🧩 Install necessary TTF fonts (PdfPig will use them)
RUN apt-get update && \
apt-get install -y --no-install-recommends \
fontconfig \
fonts-dejavu \
fonts-liberation \
fonts-liberation2 && \
fc-cache -fv && \
# Verify Liberation fonts installed, fail build if not found
fc-list | grep -i liberation || (echo "Liberation fonts not found!" && exit 1) && \
rm -rf /var/lib/apt/lists/*
Adding your solution to the wiki