stanza
stanza copied to clipboard
Russian text is not broken down into sentences
The Russian text is not broken down into sentences. Help me to solve this problem
test code
text = "Это первое предложение. Это тестовая строка. Почему бы не написать что-нибудь еще." pipeline = stanfordnlp.Pipeline(lang="ru") doc = pipeline(text.lower()) print("The tokenizer split the input into {} sentences.".format(len(doc.sentences)))
result
The tokenizer split the input into 1 sentences.
Thank you for your information! The problem can be reproduced. We will try to figure it out soon.
@c1ewd Looks like the tokenizer over fit to casing information. If you don't lower()
the text, sentence split actually looks reasonable.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This is no longer an issue with the latest Russian model - not sure when that happened, honestly