nltk icon indicating copy to clipboard operation
nltk copied to clipboard

Finishing up Stanford Deprecation

Open tomaarsen opened this issue 2 years ago • 7 comments

Hello!

As some of you might be aware, several Stanford related classes have been deprecated back in 2017. They are the following:

These have been replaced by the following newer classes:1

Note that each of these new classes rely on a CoreNLPServer running. One of the ways to get this to run is directly from the source using Java, as mentioned in https://github.com/nltk/nltk/pull/1735#issuecomment-306091826 by the author of most of these changes, @alvations. He used:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000

Note that newer versions of the stanford-corenlp package are available nowadays. Alternatively, the CoreNLPServer class can also be used to run the server in Python, though I haven't gotten that to work on Windows.


What now?

All of these Stanford classes contain DeprecationWarnings placed back in 2017, such as this one: https://github.com/nltk/nltk/blob/d21646dbd547cdd02d0c60f8e23d1d28a9fd1266/nltk/tokenize/stanford_segmenter.py#L71-L82

Clearly, we need to make some changes here. We're on v3.6.3 now.

With this issue I invite some discussion on the following options (among others):

  1. Remove the deprecated classes in their entirety.
  2. Remove the bodies of the methods, and point to a documentation reference of porting these methods to the newer CoreNLP equivalents.
  3. Keep them, but don't maintain them if we have issues in the future.

Personally I'm leaning towards either 1 or 2.

However, before simply removing potentially often used code, I went over each of the deprecated classes to see if there are indeed new equivalents, and for adding to the documentation somewhere.


Stanford updating reference

The following table contains the deprecated classes with their main methods, and the equivalent newer classes and methods. Each line on the left column is equivalent to a line on the right column.

Old New
POS Tagger
>>> from nltk.tag.stanford import StanfordPOSTagger
>>> tagger = StanfordPOSTagger()
>>> tagger.tag(...)
>>> tagger.tag_sents(...)
>>> tagger.parse_output(...)
>>> ...
>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser(tagtype="pos")
>>> parser.tag(...)
>>> parser.tag_sents(...)
>>> *deprecated*
>>> parser.raw_tag_sents(...)
NER Tagger
>>> from nltk.tag.stanford import StanfordNERTagger
>>> tagger = StanfordNERTagger()
>>> tagger.tag(...)
>>> tagger.tag_sents(...)
>>> tagger.parse_output(...)
>>> ...
>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser(tagtype="ner")
>>> parser.tag(...)
>>> parser.tag_sents(...)
>>> *deprecated*
>>> parser.raw_tag_sents(...)
StanfordParser
>>> from nltk.parse.stanford import StanfordParser
>>> parser = StanfordParser()
>>> parser.parse_sents(...)
>>> parser.raw_parse(...)
>>> parser.raw_parse_sents(...)
>>> parser.tagged_parse(...)
>>> parser.tagged_parse_sents(...)
>>> ...
>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser()
>>> parser.parse_sents(...)
>>> parser.raw_parse(...)
>>> parser.raw_parse_sents(...)
>>> *deprecated*
>>> *deprecated*
>>> parser.parse_text()
StanfordTokenizer
>>> from nltk.tokenize.stanford import StanfordTokenizer
>>> tokenizer = StanfordTokenizer()
>>> tokenizer.tokenize(...)
>>> tokenizer.tokenize_sents(...)
>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser()
>>> parser.tokenize(...)
>>> parser.tokenize_sents(...)
StanfordSegmenter
>>> from nltk.tokenize import StanfordSegmenter
>>> segmenter = StanfordSegmenter()
>>> segmenter.tokenize(...)
>>> segmenter.tokenize_sents(...)
>>> segmenter.segment_file(...)
>>> segmenter.segment(...)
>>> segmenter.segment_sents(...)
>>> from nltk.parse import CoreNLPParser
>>> parser = CoreNLPParser()
>>> parser.tokenize(...)
>>> parser.tokenize_sents(...)
>>> *deprecated*
>>> *deprecated*
>>> *deprecated*

Notes

  • StanfordDependencyParser used to have the same methods as StanfordParser. Nowadays, you should use CoreNLPDependencyParser instead, which has the same methods as CoreNLPParser.

My goal with this PR is to reach a consensus on how to move forwards, and then create a PR with those agreed upon changes, so feel free to share your opinion.

  • Tom Aarsen

Footnotes

1: StanfordNeuralDependencyParser was never fully implemented, and as a result does not exist in the newer CoreNLP... format.

tomaarsen avatar Sep 21 '21 16:09 tomaarsen

Hi, this is impressive analysis! I'll share some context.

At the time NLTK's CoreNLP REST bindings (CoreNLPServer and friends) were developed, the Stanford team came up with their own CoreNLP client called Stanza, see especially the client docs. At that time we were recommending to use Stanza rather then NLTK's corenlp client.

However, to my surprise, NLTK's corenlp client has been used and even got occasional PRs, which suggest that there is value of having CoreNLP client that deeply integrated with NLTK. In my opinion, it is the integration that could bring real benefit. For example, once you are familiar with NLTK's dependency graph API, it's easy to use CoreNLP to get dependencies.

Maybe, instead of having a completely custom client code, it would worth using Stanza to perform API calls and handle customization. I'm not entirely happy how customization is currently handled in NLTK's client, e.g. how default properties are handled.

Ideally, Stanza should be used as much as possible, and NLTK's CoreNLP code would be a thin layer on top to provide a unified API and integration with other NLTK modules.

I would also think of extending NLTK's documentation, so there is no need to refer to PR as documentation. Tests is another big topic, especially after pytest adoption.

I should be able to find some time to help if needed.

dimazest avatar Sep 21 '21 17:09 dimazest

Thank you for the context! I've heard of Stanza, but that's just about it. I agree that it seems best to let Stanza handle most of the logic, while we focus on e.g. integrating Stanza with the remainder of NLTK, such as with outputting to Tree objects. Assuming that is worth the time investment.

Beyond that, I am interested in updating some documentation to add the table above, however I'm not quite sure where it fits best.

And regarding tests - I was looking into automatically downloading some of the third party tools prior to executing the CI tests. In fact, that is how I rediscovered all of these deprecated classes.

tomaarsen avatar Sep 21 '21 18:09 tomaarsen

And regarding tests - I was looking into automatically downloading some of the third party tools prior to executing the CI tests. In fact, that is how I rediscovered all of these deprecated classes.

That would be super useful.

dimazest avatar Sep 21 '21 18:09 dimazest

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000

With this code above I get the error Error: Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLPServer

I executed the code under the folder stanford-segmenter-2020-11-17

How can I launch the StanfordCoreNLPServer?

ehsong avatar Sep 26 '22 12:09 ehsong

@ehsong the Stanford Segmenter is deprecated. Instead, you can use the CoreNLPParser:

from nltk.parse.corenlp import CoreNLPServer, CoreNLPParser

# This context manager syntax starts the server when the scope is entered,
# and stops the server when the scope is exited
with CoreNLPServer() as server:
    parser = CoreNLPParser(server.url)

    sentence = "This is my sentence, which I'd like to get parsed."
    tokenized = list(parser.tokenize(sentence))
    print(tokenized)

which outputs:

['This', 'is', 'my', 'sentence', ',', 'which', 'I', "'d", 'like', 'to', 'get', 'parsed', '.']

Note that this requires a CLASSPATH environment variable to be set to the folder containing stanford-corenlp-X.X.X.jar. You can download the zip with this folder here: https://stanfordnlp.github.io/CoreNLP/download.html

Alternatively, you can use another tokenizer if you wish. (e.g. from nltk import word_tokenize)

Hope that helps.

tomaarsen avatar Sep 26 '22 12:09 tomaarsen

@tomaarsen

Hi, thank you so much for the quick response. I set the CLASSPATH to where the .jar file is but it is still throwing errors, could you help? I did the following:

os.environ['CLASSPATH'] = "/Users/esthersong/Dropbox/ChinaDispute/Figure/Chinese/Analysis_7_StanfordNLP/stanford-corenlp-4.5.1/*"

It still throws the following error:

LookupError: 

===========================================================================
  NLTK was unable to find stanford-corenlp-(\d+)\.(\d+)\.(\d+)\.jar!
  Set the CLASSPATH environment variable.

ehsong avatar Sep 26 '22 13:09 ehsong

I would recommend adding the environment variable in Windows itself, not in Python. You can google the steps, but here's a short overview:

  • Hit the Windows key to start searching.
  • Type Edit the system environment variables and hit enter.
  • Click on "Environment Variables" towards the bottom of the "System Properties" popup (In the Advanced tab)
  • In User variables for ... or System variables, click on "New"
  • Give CLASSPATH as the variable name, and the full path as the value. This may be /Users/esthersong/Dropbox/ChinaDispute/Figure/Chinese/Analysis_7_StanfordNLP/stanford-corenlp-4.5.1, or it may be prefixed with some number denoting which harddrive you're using (e.g. C:/Users/...)
  • You may have to restart your IDE or whatever you're using to run your code, as it will generally only fetch environment variables on startup.

Good luck!

tomaarsen avatar Sep 26 '22 13:09 tomaarsen