PdfPig Using DuplicateOverlappingTextProcessor in HOcrTextExporter

I'm trying to see how can use DuplicateOverlappingTextProcessor as part of the HOcrTextExporter process.

hocrTextExporter.Get() only accepts a Page as input, along with wordExtractor and pagesegmentor in the constructor.

Where as DuplicateOverlappingTextProcessor only returns a list of letters. Doesn't seem to be a defined way to get from 1 to the other.

I think solution is to add an option to the Word Extractor Options. And use like so.

        var ops = new NearestNeighbourWordExtractorOptions();
        ops.DeduplicateOverlappingText = true;
        var wordExtractor = new NearestNeighbourWordExtractor(ops);
        HOcrTextExporter hocrTextExporter = new HOcrTextExporter(wordExtractor, DocstrumBoundingBoxes.Instance);
        string hocrtext = hocrTextExporter.Get(page, useHocrjs: true);

Having a look I think only need the below 2 changes to the 1 class. I'm not able to test code at the moment though.

UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor NearestNeighbourWordExtractor.cs

    /// <summary>
    /// Get the words.
    /// </summary>
    /// <param name="letters">The page's letters to group into <see cref="Word"/>s.</param>
    /// <returns>The <see cref="Word"/>s generated by the nearest neighbour method.</returns>
    public IEnumerable<Word> GetWords(IReadOnlyList<Letter> letters)
    {
        if (letters == null || letters.Count == 0)
        {
            return Array.Empty<Word>();
        }

        // #### Change 1
        // Remove overlapping duplicates
        if (options.DeduplicateOverlappingText) {
            letters = DuplicateOverlappingTextProcessor.Get(letters);
        }

....

    /// <summary>
    /// Nearest neighbour word extractor options.
    /// </summary>
    public class NearestNeighbourWordExtractorOptions : IWordExtractorOptions
    {
        /// <summary>
        /// <inheritdoc/>
        /// Default value is -1.
        /// </summary>
        public int MaxDegreeOfParallelism { get; set; } = -1;

        // #### Change 2
        /// <summary>
        /// Uses DuplicateOverlappingTextProcessor to remove overlapping letters before GetWords. 
        /// </summary>
        public bool DeduplicateOverlappingText = false;

Happy if there's an alternative existing way of doing it?

Jul 09 '24 13:07 KayTannee

@BobLd do you know if this would be possible, aiui the change would be to feed the deduplicated letters into the word detection algorithm. Does a pre-processing step/pipeline exist for such a thing today?

Sep 29 '24 15:09 EliotJones

I believe this is possible but I'd need to look into it, not sure how easy it is

Sep 29 '24 17:09 BobLd

Closing in line with #1095

Jul 20 '25 01:07 EliotJones

PdfPig PdfPig copied to clipboard

Using DuplicateOverlappingTextProcessor in HOcrTextExporter

PdfPig
PdfPig copied to clipboard