MBSP icon indicating copy to clipboard operation
MBSP copied to clipboard

Lemmatization fails on CAPITALS

Open Liontooth opened this issue 9 years ago • 0 comments

Fresh download of MBSP 1.4 from Github today:

>>> MBSP.lemmatize("The cats were sleeping.", tokenize=True) 
u'the cat be sleep .'

For capital letters, only the first word works:

>>> MBSP.lemmatize("CATS WERE SLEEPING.", tokenize=True)
u'cat WERE SLEEPING .'

Other parts of MBSP have the same problem -- first word works, the rest fail:

>>> MBSP.parse('EATING PIZZA WITH A FORK.', lemmata=True)
u'EATING/VBG/I-VP/O/VP-1/A1/eat PIZZA/NN/I-NP/O/NP-OBJ-1/O/PIZZA WITH/IN/I-PP/B-PNP/O/P1/WITH A/DT/I-NP/I-PNP/O/P1/A FORK/NNP/I-NP/I-PNP/O/P1/FORK ././O/O/O/O/.'

>>> MBSP.tag(string, tokenize=True, lemmata=True)
u'CATS/NNS/cat ARE/VBP/ARE SLEEPING/NN/SLEEPING'

Sentences with initial capitals on non-initial words are handled correctly:

>>> MBSP.lemmatize("The Republicans were sleeping.", tokenize=True)
u'the Republican be sleep .'

Cheers, David

Liontooth avatar Aug 09 '14 13:08 Liontooth