markovify icon indicating copy to clipboard operation
markovify copied to clipboard

where is the math?

Open randomgambit opened this issue 5 years ago • 6 comments

Hi there and thanks for this beautiful package!! i was curious to understand more the inner workings of the package. Is there a pdf that explains how you use Markov chains exactly to create a model (with some equations)?

thanks!

randomgambit avatar Jul 26 '20 20:07 randomgambit

Thanks for the kind words, @randomgambit! I'm not a trained mathematician, so I am not the most authoritative person to answer this, but I believe that this chapter of Grinstead and Snell provides a helpful overview: https://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter11.pdf

Does that answer your question?

jsvine avatar Aug 02 '20 18:08 jsvine

thank you! yes i know how markov chains work but I am not sure how do you implement them in the package. what is the transition matrix, how many states, etc. Can you please share more details?

thanks!

randomgambit avatar Aug 02 '20 18:08 randomgambit

Hi @randomgambit. The state size defaults to 2 but is configurable. The transition matrix is calculated in markovify.chain.Chain.

jsvine avatar Aug 02 '20 18:08 jsvine

got it, but it is a bit hard to understand what is going on by just looking at the code. I think it would immensely help if you could write a small example of how the algo works (by using, say, a corpus of just two simple sentences).

randomgambit avatar Aug 02 '20 18:08 randomgambit

Would something like this satisfy your curiosity?:

For text corpora, Markovify begins the process of building its Markov models by splitting each corpus into a series of sentences, and each sentence into a series of "tokens" (i.e., words, plus placeholder tokens for the beginning and end of a sentence).

Then, walking sentence-by-sentence, it identifies every sequence of state_size (default: 2) tokens, and calculates how many times any other token comes immediately afterward. This dictionary of corpus[(token_a, token_b, ...)][next_token]: count frequencies comprise the "transition matrix" of the resulting Markov model.

Thus, "Janice walked to the park. After that, Janice walked to the zoo." becomes:

{
 ('___BEGIN__', '___BEGIN__'): {'After': 1, 'Janice': 1},
 ('___BEGIN__', 'After'): {'that,': 1},
 ('___BEGIN__', 'Janice'): {'walked': 1},
 ('After', 'that,'): {'Janice': 1},
 ('Janice', 'walked'): {'to': 2},
 ('that,', 'Janice'): {'walked': 1},
 ('the', 'park.'): {'___END__': 1},
 ('the', 'zoo.'): {'___END__': 1},
 ('to', 'the'): {'park.': 1, 'zoo.': 1},
 ('walked', 'to'): {'the': 2}}
}

When Markovify attempts to generate a new sentence, it randomly chooses each successive token using these frequency-based probabilities.

jsvine avatar Aug 05 '20 03:08 jsvine

super interesting!!! thanks! I really think you should include this example on the main page.

I am just trying to figure out the exact size of the transition matrix in the example above... can we write it down or there are too many rows/columns?

Thanks!

randomgambit avatar Aug 05 '20 03:08 randomgambit