ru_punkt
ru_punkt copied to clipboard
Handling ellipsis
Hi! Thank you for your contribution on nltk project. Your model handling Russian punctuation much better than other nltk models, but there is an issue with a ellipsis(...). Examples:
>>> import nltk
>>> sent_tokenize = nltk.data.load('tokenizers/punkt/russian.pickle')
>>> sent_tokenize.tokenize("Мама мыла раму… Папа мыл кларнет...")
['Мама мыла раму… Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму... Папа мыл кларнет...")
['Мама мыла раму... Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму!!! Папа мыл кларнет...")
['Мама мыла раму!!!', 'Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму!.. Папа мыл кларнет...")
['Мама мыла раму!..', 'Папа мыл кларнет...']
Is it work as designed (ex. 1 and ex. 2)? Ellipsis in Russian usually shows the end of a sentence, but maybe I am wrong.
Hi,
I doubt we can fix that easy. Probably one can split sentences by ellipsis taking into account the case of the first letter of the next word following ellipsis. Need to have a deeper look how it could be done with PunktSentenceTokenizer class.