PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

Apostrophe is placed at wrong location in word since 0.1.12

Open HertBp opened this issue 1 month ago • 2 comments

This worked fine in 0.1.11, but since 0.1.12 parsing text with words with apostrophes in them goes wrong, in some cases. See attached sample.

apostrophe.pdf

the line: Alarcão, J. & Etienne, R. (1977), Fouilles de Conimbriga I, L’architecture .Paris.

is parsed as: Alarcão, J. & Etienne, R. (1977), Fouilles de Conimbriga I, Larchitectur’ e.Paris.

which is wrong.

I use this code (works fine in 0.1.11) to get the words:

NearestNeighbourWordExtractor wordExtractor = new NearestNeighbourWordExtractor(new NearestNeighbourWordExtractor.NearestNeighbourWordExtractorOptions()
{
    /* Letter pivot, Letter condidate*/
    Filter = (pivot, candidate) =>
    {
        // check if white space (default implementation of 'Filter')
        if (string.IsNullOrWhiteSpace(candidate.Value))
        {
            // pivot and candidate letters cannot belong to the same word 
            // if candidate letter is null or white space.
            // ('FilterPivot' already checks if the pivot is null or white space by default)
            return false;
        }
        // check for height difference
        double maxHeight = Math.Max(pivot.PointSize, candidate.PointSize);
        double minHeight = Math.Min(pivot.PointSize, candidate.PointSize);
        if (minHeight != 0 && maxHeight / minHeight > 2.0)
        {
            // pivot and candidate letters cannot belong to the same word 
            // if one letter is more than twice the size of the other.
            return false;
        }

        try
        {
            // check for colour difference
            if (!pivot.Color.Equals(candidate.Color))
            {
                // pivot and candidate letters cannot belong to the same word 
                // if they don't have the same colour.
                return false;
            }
        }
        catch
        {
            //bug in the library, temporary workaround TODO remove this
        }

        return true;
    },

    MaximumDistance = (l1, l2) =>
    {
        List<string> exceptions = new List<string> {"V", "T", "/", "’" };
        double maxDist = Math.Max(Math.Max(Math.Max(Math.Max(Math.Max(
            Math.Abs(l1.GlyphRectangle.Width),
            Math.Abs(l2.GlyphRectangle.Width)),
            Math.Abs(l1.Width)),
            Math.Abs(l2.Width)),
            l1.PointSize), l2.PointSize) * 0.16;

        if(exceptions.Contains(l1.Value))
        {
            maxDist *= 1.2;
        }

        if (l1.TextOrientation == TextOrientation.Other || l2.TextOrientation == TextOrientation.Other)
        {
            return 2.0 * maxDist;
        }
        return maxDist;
    }
});

List<Word> words = new List<Word>( wordExtractor.GetWords(letters));

HertBp avatar Nov 28 '25 07:11 HertBp

@HertBp Using PdfPig 0.1.11 on Windows and you code, I'm getting wrong output: La’rchitecture

On which OS are you running your code?

BobLd avatar Nov 29 '25 09:11 BobLd

I run on windows 11. Ok, so rechecked this and it is also wrong for me on 0.1.11, which is strange as when I run the whole document through it it is correct. But it looks not 100% reliable in 0.1.11 as well, but still more reliable then in 0.1.12. I am looking into this with more detail now (I have 234 documents in a regression test) and see that in some cases 0.1.12 does better but in most it goes wrong: apostrophe moved to back of the word.

In this example it should be correct for 0.1.11 and wrong for 0.1.12 (the word L’Ambiguïté about halfway the page):

apostrophe.pdf

HertBp avatar Dec 01 '25 06:12 HertBp