pragmatic_segmenter
pragmatic_segmenter copied to clipboard
Text Chunking
Pragmatic_Segmenter
should be able to return segments of sentences of a maximum size.
E.g.
https://github.com/akalsey/textchunk
https://github.com/algolia/chunk-text
The following code example is donated by https://auditus.cc courtesy of @Havenwood.
It is used to ensure that the conversion requests stay within the limits of AWS Polly
(1500 char limit)
optimized_sentences = sentences.each_with_object([]) do |sentence, accumulator| # like reduce except better
if (accumulator.last&.size&. < 1500) && ((accumulator.last&.size&. + sentence.size& + 2) < 1500)
accumulator.last << sentence
accumulator.last << ' '
else
accumulator << sentence
accumulator.last << ' '
end
end
Given an array of sentences, the above code counts the number of characters in each sentence and concatenates a sentence and the following sentence together if their sum is less than an arbitray number of characters (in this case 1500, amazon polly char limit)
https://gist.github.com/havenwood/e9c286c524f2de5649586e7d28fec7af
The above code however do not handle cases where the length of the concatenated string exceeds 1500, here is one possible method (also donated by https://auditus.cc):
stripped_sentences = optimized_sentences.flat_map { |sentence|
# sentence_too_long? sentence
if sentence.size > 1500
# sentence.scan(/.{1, 1500}/)
sentence.scan(/(\S{1500,}|.{1,1500})(?:\s|$)/)
else
sentence
end
}