markovify subclassing markovify.Text to allow for different types of 'sentences'

hi and thx for yr great library.

i made a cli program to run it on my own texts.

i'm trying to add a subclass to it that enables me to feed it sentences that dont begin with initial capital letters and might begin with stars, bullets, etc. i made a subclass (modeled on your NewlineText) to modify the regexes in split_into_sentences(), changing the lookahead search that mandates an initial capital letter after sentence end (splitters.py, line 45) to read r"\s+(?=[-•\w‘’“”'*\|/~\",])",, and added a few more punctuation marks to the previous regexes (hypen, ellipses/triple periods).

it works if i manually generate a corpus and markov model from one of my texts, but not if i run my program using the subclass. one "sentence" will have a period in the middle of it and will continue printing text after it.

so i wanted to ask if there anything in the way that sentences are made from the markov model that would affect these modified regexes or disregard them? and is there a better way to go about modifying sentence endings than messing with split_into_sentences()?

[sorry if its obvious in the code. i'm very much a novice with programming.]

Oct 29 '20 17:10 mooseyboots

Hi @mooseyboots, and thanks for your interesting in this library. I'm having a bit of trouble, however, understanding the specifics of your inquiry. Could you provide some code, inputs, and outputs that demonstrate the issue?

Oct 30 '20 02:10 jsvine

here is my subclass modifying split_into_sentences():

import re
import markovify
from markovify.splitters import is_sentence_ender


class NoInitCaps(markovify.Text):
    """
    An attempt to subclass markovify.Text to allow for sentences to not begin with an intital capital letter.
    """

    def split_into_sentences(self, text):
        potential_end_pat = re.compile(
            r"".join(
                [
                    r"([-\w\.\"'’~”&\]\)]+[…(\.){1,4}\?!])",  # A word that ends with punctuation, including ellipsis, possibly separated by white space
                    r"([‘’“”'~\"\)\]]*)",  # Followed by optional quote/parens/etc
                    r"\s+(?=[-•\w‘’“”'*\|/~\",])",  # followed by whitespace. then a lookahead to the next char, which can be alphanumeric or initial punctuation
                ]
            ),
            re.U,  # U for Unicode!
        )
        dot_iter = re.finditer(potential_end_pat, text)
        end_indices = [
            (x.start() + len(x.group(1)) + len(x.group(2)))
            for x in dot_iter
            if is_sentence_ender(x.group(1))
        ]
        spans = zip([None] + end_indices, end_indices + [None])
        sentences = [text[start:end].strip() for start, end in spans]
        return sentences

    def sentence_split(self, text):
        return self.split_into_sentences(text)

a selection of input from one of my files:

    • error, which makes things swollen, gives them that look of filling out just a little more space than is theirs, so that they bump into other swollen things, seek room.

renege.

‘empty’ words, imagine!

who among us not embalmed.
walk up to wall and kick it, once, twice, there.

, lying in wait / for the neonate.

1 incorrect 'sentence' from the sample output using my subclass:

who among us not embalmed. walk up to the rhythm of beer and coffee, on the verge of nothing here.

so the word "embalmed." is not counting as an end.

but what confused me is that if i use the subclass manually to generate a corpus, such as something like:

  from markovify import Chain, Text
  from mkv_this.noinitcaps import NoInitCaps

  text = "/PATH/TO/INPUT/scrapbook.txt"

  with open(text, "r") as t:
      txt = t.read()
      text_obj = NoInitCaps(txt) # my subclass
      corpus = text_obj.generate_corpus(txt)
      clist = list(corpus)
  with open("/PATH/TO/OUTPUT/markov-corpus-no-init-caps.txt", "w") as c:
        c.write(str(clist))

the word "embalmed." will actually be the last item in its sentence's list:

 ['renege.'], ['‘empty’', 'words,', 'imagine!'], ['who', 'among', 'us', 'not', 'embalmed.'], ['walk', 'up', 'to', 'wall', 'and', 'kick', 'it,', 'once,', 'twice,', 'there.'], [',', 'lying', 'in', 'wait', '/', 'for', 'the', 'neonate.']

which to me suggested that the regex sentence splitter was working correctly.

my query is, if i want to change how markovify understands what constitutes the end of a sentence, is that all i need to do or are there other things to modify?

Oct 30 '20 08:10 mooseyboots

Hi @mooseyboots, and thanks for the additional details. Judging from the sample of the corpus you shared, which seems to place each sentence on a new line, the easiest solution may just be to use the already-defined markovify.NewlineText class.

And if that doesn't quite fit your use-case, you can use that subclass's definition as a perhaps-simpler starting place (i.e., swapping out the regular expression below for the regular expression of your choosing):

https://github.com/jsvine/markovify/blob/16b936790b01edf403a7b0deb468f4096d0cd292/markovify/text.py#L287-L293

Nov 03 '20 16:11 jsvine

markovify markovify copied to clipboard

subclassing markovify.Text to allow for different types of 'sentences'

markovify
markovify copied to clipboard