nltk
nltk copied to clipboard
Finishing up Stanford Deprecation
Hello!
As some of you might be aware, several Stanford related classes have been deprecated back in 2017. They are the following:
-
nltk.tag.StanfordTagger
-
nltk.tag.StanfordPOSTagger
-
nltk.tag.StanfordNERTagger
-
nltk.parse.GenericStanfordParser
-
nltk.parse.StanfordParser
-
nltk.parse.StanfordDependencyParser
-
nltk.parse.StanfordNeuralDependencyParser
-
nltk.tokenize.StanfordTokenizer
-
nltk.tokenize.StanfordSegmenter
These have been replaced by the following newer classes:1
Note that each of these new classes rely on a CoreNLPServer
running. One of the ways to get this to run is directly from the source using Java, as mentioned in https://github.com/nltk/nltk/pull/1735#issuecomment-306091826 by the author of most of these changes, @alvations. He used:
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000
Note that newer versions of the stanford-corenlp package are available nowadays.
Alternatively, the CoreNLPServer
class can also be used to run the server in Python, though I haven't gotten that to work on Windows.
What now?
All of these Stanford classes contain DeprecationWarnings placed back in 2017, such as this one: https://github.com/nltk/nltk/blob/d21646dbd547cdd02d0c60f8e23d1d28a9fd1266/nltk/tokenize/stanford_segmenter.py#L71-L82
Clearly, we need to make some changes here. We're on v3.6.3 now.
With this issue I invite some discussion on the following options (among others):
- Remove the deprecated classes in their entirety.
- Remove the bodies of the methods, and point to a documentation reference of porting these methods to the newer CoreNLP equivalents.
- Keep them, but don't maintain them if we have issues in the future.
Personally I'm leaning towards either 1 or 2.
However, before simply removing potentially often used code, I went over each of the deprecated classes to see if there are indeed new equivalents, and for adding to the documentation somewhere.
Stanford updating reference
The following table contains the deprecated classes with their main methods, and the equivalent newer classes and methods. Each line on the left column is equivalent to a line on the right column.
Old | New |
---|---|
POS Tagger | |
|
|
NER Tagger | |
|
|
StanfordParser | |
|
|
StanfordTokenizer | |
|
|
StanfordSegmenter | |
|
|
Notes
-
StanfordDependencyParser
used to have the same methods asStanfordParser
. Nowadays, you should useCoreNLPDependencyParser
instead, which has the same methods asCoreNLPParser
.
My goal with this PR is to reach a consensus on how to move forwards, and then create a PR with those agreed upon changes, so feel free to share your opinion.
- Tom Aarsen
Footnotes
1: StanfordNeuralDependencyParser
was never fully implemented, and as a result does not exist in the newer CoreNLP...
format.
Hi, this is impressive analysis! I'll share some context.
At the time NLTK's CoreNLP REST bindings (CoreNLPServer
and friends) were developed, the Stanford team came up with their own CoreNLP client called Stanza, see especially the client docs. At that time we were recommending to use Stanza rather then NLTK's corenlp client.
However, to my surprise, NLTK's corenlp client has been used and even got occasional PRs, which suggest that there is value of having CoreNLP client that deeply integrated with NLTK. In my opinion, it is the integration that could bring real benefit. For example, once you are familiar with NLTK's dependency graph API, it's easy to use CoreNLP to get dependencies.
Maybe, instead of having a completely custom client code, it would worth using Stanza to perform API calls and handle customization. I'm not entirely happy how customization is currently handled in NLTK's client, e.g. how default properties are handled.
Ideally, Stanza should be used as much as possible, and NLTK's CoreNLP code would be a thin layer on top to provide a unified API and integration with other NLTK modules.
I would also think of extending NLTK's documentation, so there is no need to refer to PR as documentation. Tests is another big topic, especially after pytest adoption.
I should be able to find some time to help if needed.
Thank you for the context!
I've heard of Stanza, but that's just about it. I agree that it seems best to let Stanza handle most of the logic, while we focus on e.g. integrating Stanza with the remainder of NLTK, such as with outputting to Tree
objects. Assuming that is worth the time investment.
Beyond that, I am interested in updating some documentation to add the table above, however I'm not quite sure where it fits best.
And regarding tests - I was looking into automatically downloading some of the third party tools prior to executing the CI tests. In fact, that is how I rediscovered all of these deprecated classes.
And regarding tests - I was looking into automatically downloading some of the third party tools prior to executing the CI tests. In fact, that is how I rediscovered all of these deprecated classes.
That would be super useful.
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-preload tokenize,ssplit,pos,lemma,parse,depparse \
-status_port 9000 -port 9000 -timeout 15000
With this code above I get the error Error: Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLPServer
I executed the code under the folder stanford-segmenter-2020-11-17
How can I launch the StanfordCoreNLPServer?
@ehsong the Stanford Segmenter is deprecated. Instead, you can use the CoreNLPParser
:
from nltk.parse.corenlp import CoreNLPServer, CoreNLPParser
# This context manager syntax starts the server when the scope is entered,
# and stops the server when the scope is exited
with CoreNLPServer() as server:
parser = CoreNLPParser(server.url)
sentence = "This is my sentence, which I'd like to get parsed."
tokenized = list(parser.tokenize(sentence))
print(tokenized)
which outputs:
['This', 'is', 'my', 'sentence', ',', 'which', 'I', "'d", 'like', 'to', 'get', 'parsed', '.']
Note that this requires a CLASSPATH
environment variable to be set to the folder containing stanford-corenlp-X.X.X.jar
. You can download the zip with this folder here: https://stanfordnlp.github.io/CoreNLP/download.html
Alternatively, you can use another tokenizer if you wish. (e.g. from nltk import word_tokenize
)
Hope that helps.
@tomaarsen
Hi, thank you so much for the quick response. I set the CLASSPATH
to where the .jar file is but it is still throwing errors, could you help? I did the following:
os.environ['CLASSPATH'] = "/Users/esthersong/Dropbox/ChinaDispute/Figure/Chinese/Analysis_7_StanfordNLP/stanford-corenlp-4.5.1/*"
It still throws the following error:
LookupError:
===========================================================================
NLTK was unable to find stanford-corenlp-(\d+)\.(\d+)\.(\d+)\.jar!
Set the CLASSPATH environment variable.
I would recommend adding the environment variable in Windows itself, not in Python. You can google the steps, but here's a short overview:
- Hit the Windows key to start searching.
- Type
Edit the system environment variables
and hit enter. - Click on "Environment Variables" towards the bottom of the "System Properties" popup (In the Advanced tab)
- In
User variables for ...
orSystem variables
, click on "New" - Give
CLASSPATH
as the variable name, and the full path as the value. This may be/Users/esthersong/Dropbox/ChinaDispute/Figure/Chinese/Analysis_7_StanfordNLP/stanford-corenlp-4.5.1
, or it may be prefixed with some number denoting which harddrive you're using (e.g.C:/Users/...
) - You may have to restart your IDE or whatever you're using to run your code, as it will generally only fetch environment variables on startup.
Good luck!