PyMarkovChain icon indicating copy to clipboard operation
PyMarkovChain copied to clipboard

Ability to add text samples to existing database

Open TehMillhouse opened this issue 12 years ago • 5 comments

This would require the markov chain to defer calculation of the occurrence probability until during text generation, but should be quite doable.

Also, switching the _nextWord function over to doing integer math will do away with rounding errors and will improve performance. Yay!

TehMillhouse avatar Sep 09 '13 15:09 TehMillhouse

This would require the markov chain to defer calculation of the occurrence probability until during text generation, but should be quite doable.

Not really. That would make the text generation slower, and you probably want to keep the slowest code-paths in the db generation code.

I suggest doing something a bit different: store word counts AND calculated probabilities in the db. Then, when adding more samples, simply add to those numbers and use the result to calculate the probabilities.

I think it'll be simpler to implement too.

elad661 avatar Nov 19 '13 23:11 elad661

Every time new information is added to the db, all probabilities shift as a result. If my current db (for any single word as last state) is {lol : 0.5, lol_wordcount : 3, haha : 0.5, haha_wordcount : 3}, and I encounter rofl, not only will rofl and rofl_wordcount be added, but all other probabilities will change too. So at that time, I either recalculate all probabilities, or defer this calculation until it's needed.

All I currently need for word selection (see _nextWord) is that all candidates have a corresponding value, the sum of which is the upper bound of my randomly selected sample. I don't ever really need floating point numbers for that, I've just used them in my implementation because it's easier to reason about probabilities when they fall within the mathematical definition as numbers in the interval [0,1].

~~P.S. I think I just spotted a mathematical flaw in word selection, I'll open an issue as soon as I'm sure of it.~~ Never mind.

TehMillhouse avatar Nov 20 '13 00:11 TehMillhouse

Just to be clear: If this is done the right way, db generation would be a mere matter of counting words, and _nextWord wouldn't change except for exchanging the call to random.random() with random.randrange(self.wordcounttotals[lastword]) (where self.wordcounttotals[lastword] is the total number of times lastword was followed by another word)

TehMillhouse avatar Nov 20 '13 00:11 TehMillhouse

Is there an update on this? It would be really useful for a project we're working on.

kach avatar Dec 31 '14 00:12 kach

Sorry about the late answer. I'm quite busy otherwise at the moment, so it's unlikely I'll get around to implementing this in the near future.

TehMillhouse avatar Jan 02 '15 18:01 TehMillhouse