Apostrophe is placed at wrong location in word since 0.1.12
This worked fine in 0.1.11, but since 0.1.12 parsing text with words with apostrophes in them goes wrong, in some cases. See attached sample.
the line: Alarcão, J. & Etienne, R. (1977), Fouilles de Conimbriga I, L’architecture .Paris.
is parsed as: Alarcão, J. & Etienne, R. (1977), Fouilles de Conimbriga I, Larchitectur’ e.Paris.
which is wrong.
I use this code (works fine in 0.1.11) to get the words:
NearestNeighbourWordExtractor wordExtractor = new NearestNeighbourWordExtractor(new NearestNeighbourWordExtractor.NearestNeighbourWordExtractorOptions()
{
/* Letter pivot, Letter condidate*/
Filter = (pivot, candidate) =>
{
// check if white space (default implementation of 'Filter')
if (string.IsNullOrWhiteSpace(candidate.Value))
{
// pivot and candidate letters cannot belong to the same word
// if candidate letter is null or white space.
// ('FilterPivot' already checks if the pivot is null or white space by default)
return false;
}
// check for height difference
double maxHeight = Math.Max(pivot.PointSize, candidate.PointSize);
double minHeight = Math.Min(pivot.PointSize, candidate.PointSize);
if (minHeight != 0 && maxHeight / minHeight > 2.0)
{
// pivot and candidate letters cannot belong to the same word
// if one letter is more than twice the size of the other.
return false;
}
try
{
// check for colour difference
if (!pivot.Color.Equals(candidate.Color))
{
// pivot and candidate letters cannot belong to the same word
// if they don't have the same colour.
return false;
}
}
catch
{
//bug in the library, temporary workaround TODO remove this
}
return true;
},
MaximumDistance = (l1, l2) =>
{
List<string> exceptions = new List<string> {"V", "T", "/", "’" };
double maxDist = Math.Max(Math.Max(Math.Max(Math.Max(Math.Max(
Math.Abs(l1.GlyphRectangle.Width),
Math.Abs(l2.GlyphRectangle.Width)),
Math.Abs(l1.Width)),
Math.Abs(l2.Width)),
l1.PointSize), l2.PointSize) * 0.16;
if(exceptions.Contains(l1.Value))
{
maxDist *= 1.2;
}
if (l1.TextOrientation == TextOrientation.Other || l2.TextOrientation == TextOrientation.Other)
{
return 2.0 * maxDist;
}
return maxDist;
}
});
List<Word> words = new List<Word>( wordExtractor.GetWords(letters));
@HertBp Using PdfPig 0.1.11 on Windows and you code, I'm getting wrong output: La’rchitecture
On which OS are you running your code?
I run on windows 11. Ok, so rechecked this and it is also wrong for me on 0.1.11, which is strange as when I run the whole document through it it is correct. But it looks not 100% reliable in 0.1.11 as well, but still more reliable then in 0.1.12. I am looking into this with more detail now (I have 234 documents in a regression test) and see that in some cases 0.1.12 does better but in most it goes wrong: apostrophe moved to back of the word.
In this example it should be correct for 0.1.11 and wrong for 0.1.12 (the word L’Ambiguïté about halfway the page):