keystone
keystone copied to clipboard
Lemmatization
May be unnecessary for Release 0.1
Hello, Are you still planning to create a node for Lemmatization since it's already provided into CoreNLPFeatureExtractor? (same for NER, POS Tagging)
Hi there,
There are a couple of weaknesses with our current use of CoreNLP
- Performance: While CoreNLP is pretty quick, it does take some time to initialize and given the library's structure it makes sense to batch as many analyses as you can into a single pass over a document. This is the strategy we take in CoreNLPFeatureExtractor.
- Licensing: CoreNLP is licensed GPLv3 - we are linking against it and not selling KeystoneML as proprietary software, so this is fine, but this may not acceptable for all of our users. As a result, I'd like to limit reliance on CoreNLP going forward.
I'm unfamiliar with the current state-of-the-art in lemmatization, but if there's a JVM-based implementation of standard techniques that is both 1) business-friendly in licensing (Apache or BSD preferred), and 2) reasonably high performance, I'd be interested in seeing it integrated with KeystoneML.
Alternatively, if you want to take a shot at implementing something like this as the first step in more extensive NLP support, we'd welcome such a PR.
Hey Evans,
I took a quick look for any JVM library that would be interesting and found Epic, written in Scala; https://github.com/dlwh/epic which is under the Apache License, Version 2.0.
As I could see from the repo you are already using Breeze from ScalaNLP and Epic is a sub project of ScalaNLP.
They already have a NER and POSTagger implemented, and to implement a Lemmatizer we'd need the POSTagger anyway so we could build up on that.
I'll take a deeper look in the next days if the implementation is near state of the art and if you think it'd be interesting.
Cheers
the ScalaNLP stuff is great - and comes out of David Hall/Dan Klein's work, so I expect it to be quite modern. Kind of a bummer that it doesn't include lemmatization out of the box. It would be good to get a sense of how it performs vs. CoreNLP (both from a statistical and throughput perspective).
That said - for basic lemmatization we might consider taking a page from the Python nltk
playbook and using WordNet. JWI (http://projects.csail.mit.edu/jwi/) provides an alternative. It is licensed CC-BY 4.0.