stanza Russian text is not broken down into sentences

Russian text is not broken down into sentences

Open 41exey opened this issue 4 years ago • 3 comments

The Russian text is not broken down into sentences. Help me to solve this problem

test code

text = "Это первое предложение. Это тестовая строка. Почему бы не написать что-нибудь еще." pipeline = stanfordnlp.Pipeline(lang="ru") doc = pipeline(text.lower()) print("The tokenizer split the input into {} sentences.".format(len(doc.sentences)))

result

The tokenizer split the input into 1 sentences.

Dec 07 '19 19:12 41exey

Thank you for your information! The problem can be reproduced. We will try to figure it out soon.

Jan 11 '20 04:01 yuhui-zh15

@c1ewd Looks like the tokenizer over fit to casing information. If you don't lower() the text, sentence split actually looks reasonable.

Jan 22 '20 01:01 qipeng

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Dec 29 '20 18:12 stale[bot]

This is no longer an issue with the latest Russian model - not sure when that happened, honestly

Oct 03 '23 07:10 AngledLuffa

stanza stanza copied to clipboard

Russian text is not broken down into sentences

test code

result

stanza
stanza copied to clipboard