markovify
markovify copied to clipboard
markovify's make_sentence_with_start() doesn't seem to work properly
heya @jsvine. i'm writing a quite simple code with markovify, and i keep running into couple of issues.
- m_s_w_s doesn't see the sentences with words when they're clearly there, strict=False
- for some reason, when my generated prompt is exactly two word-long, it gives me a KeyErrorL: ('wors_a', 'word_b)'). it works in some cases as it expected to work, though, but in my tests issues happen a lot more often. i can give you the code if you need it.
Hi @nezetimesthree, and thanks for your interest in markovify. When you get a chance, please provide code and text that reproduces the problem. Without that, it will unfortunately be quite hard to debug.
of course. here's the code and text file.
from transformers import pipeline
import random
import markovify
model_link = "IProject-10/bert-base-uncased-finetuned-squad2"
question_answerer = pipeline("question-answering", model=model_link)
with open('mayakovsky.txt', 'r') as file:
f = file.readlines()
poems = []
poem = ''
dataset = ''
for line in f:
dataset += line.strip() + '. '
if line != '\n':
poem += line.strip() + ' '
else:
poems.append(poem)
poem = ''
context = random.choice(poems)
question = input()
answer = question_answerer(question=question, context=context)['answer']
print(answer, '->', ' '.join(answer.split()[-2:]))
text_model = markovify.Text(' '.join(poems))
if len(answer.split()) > 1:
print(text_model.make_sentence_with_start(' '.join(answer.split()[-2:]), strict=False, tries=100), end='\n')
else:
print(text_model.make_sentence_with_start(answer, strict=False, tries=100), end='\n')
for i in range(5):
print(text_model.make_short_sentence(200, min_length=100, tries=100), end='\n')
Thanks for sharing this, @nezetimesthree.
It seems that you're passing to make_sentence_with_state a "start" that was generated by an LLM, which is not guaranteed to be a start that actually exists in your corpus, which is a requirement for markovify and this type of Markov chain generally. Is that correct? If so, this is expected behavior of markovify and I would not consider it a bug.
If I've misunderstood, could you share a simpler code example that doesn't depend on other libraries, yet still reproduces the problem? In this example, the logic that uses IProject-10/bert-base-uncased-finetuned-squad2 is fairly intertwined here with the logic that uses markovify, and there are several different calls to markovify, making it difficult to debug.
thanks for taking a look, @jsvine. but you're misunderstanding this: LLM gives answers only from the given context, which, in this case, is one of the poems from the file. i've checked the errors in poem dataset, and the words were there always. for some reason, NewlineText didn't see them as a start for sentences. maybe it's because some of the lines consist only of one word? could this be the issue?
Thank you for the helpful clarification, @nezetimesthree. Could you share a start that the code fails on but that is definitely a start in the corpus?
hello again, @jsvine. sorry i didn't answer yesterday, but here's the example, the error, and the proof that it's clearly there.
Thanks; can you share that as copy-pasteable text?
addititon: here's what happens when it receives only one word
can you clarify what you mean by "copy-pastable text", though? if i understand you corretcly, then the words are "ладно слажен" and "Наоборот"
Great, thanks; that's what I was looking for, indeed.
Thanks again for the helpful example. Taking a closer look, the issue seems not to be with make_sentence_with_start, but rather the sentence parser much earlier in the processing pipeline.
import markovify
with open("mayakovsky.txt", "r") as file:
model = markovify.Text(file.read())
def test_presence(fragment):
return any(
any(fragment == token for token in sentence)
for sentence in model.parsed_sentences
)
print(test_presence("Послушайте!"))
print(test_presence("слажен"))
Prints:
True
False
The default Text model uses a regex-powered filter to remove sentences that could cause problems, mostly re. apostrophes and quotation marks. It also invokes unidecode, which seems to be causing the problem here. Because it's a generally useful approach, I don't want to remove that step from the library, but there are two ways you should be able to handle on your end:
- Calling
markovify.Text(..., well_formed=False), which skips the filtering step - Extending
markovify.Text(documented here) to behave in a way better suited to your corpus.
Using well_formed=False seems to work well, although you'll have to contend with the punctuation (or strip it out in a pre-processing step), as you'll see with the comma below:
import markovify
with open("mayakovsky.txt", "r") as file:
model = markovify.Text(file.read(), well_formed=False)
print(model.make_sentence_with_start("ладно слажен,"))
Prints: ладно слажен, — и все обвыл.
thank you very much, @jsvine. i will test it and return with the result next week. sorry for making you wait for it, but i just won't have a chance this week. thank you again, and we'll see if this works.