stanza icon indicating copy to clipboard operation
stanza copied to clipboard

Russian text is not broken down into sentences

Open 41exey opened this issue 4 years ago • 3 comments

The Russian text is not broken down into sentences. Help me to solve this problem

test code

text = "Это первое предложение. Это тестовая строка. Почему бы не написать что-нибудь еще." pipeline = stanfordnlp.Pipeline(lang="ru") doc = pipeline(text.lower()) print("The tokenizer split the input into {} sentences.".format(len(doc.sentences)))

result

The tokenizer split the input into 1 sentences.

41exey avatar Dec 07 '19 19:12 41exey

Thank you for your information! The problem can be reproduced. We will try to figure it out soon.

yuhui-zh15 avatar Jan 11 '20 04:01 yuhui-zh15

@c1ewd Looks like the tokenizer over fit to casing information. If you don't lower() the text, sentence split actually looks reasonable.

qipeng avatar Jan 22 '20 01:01 qipeng

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 29 '20 18:12 stale[bot]

This is no longer an issue with the latest Russian model - not sure when that happened, honestly

AngledLuffa avatar Oct 03 '23 07:10 AngledLuffa