MBSP
MBSP copied to clipboard
Lemmatization fails on CAPITALS
Fresh download of MBSP 1.4 from Github today:
>>> MBSP.lemmatize("The cats were sleeping.", tokenize=True)
u'the cat be sleep .'
For capital letters, only the first word works:
>>> MBSP.lemmatize("CATS WERE SLEEPING.", tokenize=True)
u'cat WERE SLEEPING .'
Other parts of MBSP have the same problem -- first word works, the rest fail:
>>> MBSP.parse('EATING PIZZA WITH A FORK.', lemmata=True)
u'EATING/VBG/I-VP/O/VP-1/A1/eat PIZZA/NN/I-NP/O/NP-OBJ-1/O/PIZZA WITH/IN/I-PP/B-PNP/O/P1/WITH A/DT/I-NP/I-PNP/O/P1/A FORK/NNP/I-NP/I-PNP/O/P1/FORK ././O/O/O/O/.'
>>> MBSP.tag(string, tokenize=True, lemmata=True)
u'CATS/NNS/cat ARE/VBP/ARE SLEEPING/NN/SLEEPING'
Sentences with initial capitals on non-initial words are handled correctly:
>>> MBSP.lemmatize("The Republicans were sleeping.", tokenize=True)
u'the Republican be sleep .'
Cheers, David