PdfPig
PdfPig copied to clipboard
Using DuplicateOverlappingTextProcessor in HOcrTextExporter
I'm trying to see how can use DuplicateOverlappingTextProcessor as part of the HOcrTextExporter process.
hocrTextExporter.Get() only accepts a Page as input, along with wordExtractor and pagesegmentor in the constructor.
Where as DuplicateOverlappingTextProcessor only returns a list of letters. Doesn't seem to be a defined way to get from 1 to the other.
I think solution is to add an option to the Word Extractor Options. And use like so.
var ops = new NearestNeighbourWordExtractorOptions();
ops.DeduplicateOverlappingText = true;
var wordExtractor = new NearestNeighbourWordExtractor(ops);
HOcrTextExporter hocrTextExporter = new HOcrTextExporter(wordExtractor, DocstrumBoundingBoxes.Instance);
string hocrtext = hocrTextExporter.Get(page, useHocrjs: true);
Having a look I think only need the below 2 changes to the 1 class. I'm not able to test code at the moment though.
UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor NearestNeighbourWordExtractor.cs
/// <summary>
/// Get the words.
/// </summary>
/// <param name="letters">The page's letters to group into <see cref="Word"/>s.</param>
/// <returns>The <see cref="Word"/>s generated by the nearest neighbour method.</returns>
public IEnumerable<Word> GetWords(IReadOnlyList<Letter> letters)
{
if (letters == null || letters.Count == 0)
{
return Array.Empty<Word>();
}
// #### Change 1
// Remove overlapping duplicates
if (options.DeduplicateOverlappingText) {
letters = DuplicateOverlappingTextProcessor.Get(letters);
}
....
/// <summary>
/// Nearest neighbour word extractor options.
/// </summary>
public class NearestNeighbourWordExtractorOptions : IWordExtractorOptions
{
/// <summary>
/// <inheritdoc/>
/// Default value is -1.
/// </summary>
public int MaxDegreeOfParallelism { get; set; } = -1;
// #### Change 2
/// <summary>
/// Uses DuplicateOverlappingTextProcessor to remove overlapping letters before GetWords.
/// </summary>
public bool DeduplicateOverlappingText = false;
Happy if there's an alternative existing way of doing it?
@BobLd do you know if this would be possible, aiui the change would be to feed the deduplicated letters into the word detection algorithm. Does a pre-processing step/pipeline exist for such a thing today?
I believe this is possible but I'd need to look into it, not sure how easy it is
Closing in line with #1095