ru_punkt icon indicating copy to clipboard operation
ru_punkt copied to clipboard

Handling ellipsis

Open dzhelonkin opened this issue 7 years ago • 1 comments

Hi! Thank you for your contribution on nltk project. Your model handling Russian punctuation much better than other nltk models, but there is an issue with a ellipsis(...). Examples:

>>> import nltk
>>> sent_tokenize = nltk.data.load('tokenizers/punkt/russian.pickle')
>>> sent_tokenize.tokenize("Мама мыла раму… Папа мыл кларнет...")
['Мама мыла раму… Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму... Папа мыл кларнет...")
['Мама мыла раму... Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму!!! Папа мыл кларнет...")
['Мама мыла раму!!!', 'Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму!.. Папа мыл кларнет...")
['Мама мыла раму!..', 'Папа мыл кларнет...']

Is it work as designed (ex. 1 and ex. 2)? Ellipsis in Russian usually shows the end of a sentence, but maybe I am wrong.

dzhelonkin avatar Aug 21 '18 09:08 dzhelonkin

Hi,

I doubt we can fix that easy. Probably one can split sentences by ellipsis taking into account the case of the first letter of the next word following ellipsis. Need to have a deeper look how it could be done with PunktSentenceTokenizer class.

Mottl avatar Aug 25 '18 17:08 Mottl