pragmatic_segmenter icon indicating copy to clipboard operation
pragmatic_segmenter copied to clipboard

Text Chunking

Open Immortalin opened this issue 6 years ago • 0 comments

Pragmatic_Segmenter should be able to return segments of sentences of a maximum size. E.g. https://github.com/akalsey/textchunk https://github.com/algolia/chunk-text

The following code example is donated by https://auditus.cc courtesy of @Havenwood. It is used to ensure that the conversion requests stay within the limits of AWS Polly (1500 char limit)

      optimized_sentences = sentences.each_with_object([]) do |sentence, accumulator| # like reduce except better
        if (accumulator.last&.size&. < 1500) && ((accumulator.last&.size&. + sentence.size& + 2) < 1500)
          accumulator.last << sentence
          accumulator.last << ' '
        else
          accumulator << sentence
          accumulator.last << ' '
        end
      end

Given an array of sentences, the above code counts the number of characters in each sentence and concatenates a sentence and the following sentence together if their sum is less than an arbitray number of characters (in this case 1500, amazon polly char limit)

https://gist.github.com/havenwood/e9c286c524f2de5649586e7d28fec7af

The above code however do not handle cases where the length of the concatenated string exceeds 1500, here is one possible method (also donated by https://auditus.cc):

      stripped_sentences = optimized_sentences.flat_map { |sentence|
        # sentence_too_long? sentence
        if sentence.size > 1500
          # sentence.scan(/.{1, 1500}/)
          sentence.scan(/(\S{1500,}|.{1,1500})(?:\s|$)/)
        else
          sentence
        end
      }

Immortalin avatar Apr 09 '18 06:04 Immortalin