keystone icon indicating copy to clipboard operation
keystone copied to clipboard

Lemmatization

Open etrain opened this issue 9 years ago • 4 comments

May be unnecessary for Release 0.1

etrain avatar Apr 13 '15 20:04 etrain

Hello, Are you still planning to create a node for Lemmatization since it's already provided into CoreNLPFeatureExtractor? (same for NER, POS Tagging)

ngarneau avatar Apr 21 '16 22:04 ngarneau

Hi there,

There are a couple of weaknesses with our current use of CoreNLP

  1. Performance: While CoreNLP is pretty quick, it does take some time to initialize and given the library's structure it makes sense to batch as many analyses as you can into a single pass over a document. This is the strategy we take in CoreNLPFeatureExtractor.
  2. Licensing: CoreNLP is licensed GPLv3 - we are linking against it and not selling KeystoneML as proprietary software, so this is fine, but this may not acceptable for all of our users. As a result, I'd like to limit reliance on CoreNLP going forward.

I'm unfamiliar with the current state-of-the-art in lemmatization, but if there's a JVM-based implementation of standard techniques that is both 1) business-friendly in licensing (Apache or BSD preferred), and 2) reasonably high performance, I'd be interested in seeing it integrated with KeystoneML.

Alternatively, if you want to take a shot at implementing something like this as the first step in more extensive NLP support, we'd welcome such a PR.

etrain avatar Apr 21 '16 22:04 etrain

Hey Evans,

I took a quick look for any JVM library that would be interesting and found Epic, written in Scala; https://github.com/dlwh/epic which is under the Apache License, Version 2.0.

As I could see from the repo you are already using Breeze from ScalaNLP and Epic is a sub project of ScalaNLP.

They already have a NER and POSTagger implemented, and to implement a Lemmatizer we'd need the POSTagger anyway so we could build up on that.

I'll take a deeper look in the next days if the implementation is near state of the art and if you think it'd be interesting.

Cheers

ngarneau avatar Apr 22 '16 15:04 ngarneau

the ScalaNLP stuff is great - and comes out of David Hall/Dan Klein's work, so I expect it to be quite modern. Kind of a bummer that it doesn't include lemmatization out of the box. It would be good to get a sense of how it performs vs. CoreNLP (both from a statistical and throughput perspective).

That said - for basic lemmatization we might consider taking a page from the Python nltk playbook and using WordNet. JWI (http://projects.csail.mit.edu/jwi/) provides an alternative. It is licensed CC-BY 4.0.

etrain avatar Apr 22 '16 17:04 etrain