CoreNLP
CoreNLP copied to clipboard
NullPointerException with AnnotationSerializer (ProtobufAnnotationSerializer) and CoreMapExpressionExtractor
I have a test that I run on CoreMapExpressionExtractor
which relies on caching the result of an annotation run so I don't have to run the parser each time. I use the ProtobufAnnotationSerializer
. However, when I unserialize the cached data all of the new line tokens have been set to NULL
. The result is that when I run the extractor a NullPointerException
gets thrown at https://github.com/stanfordnlp/CoreNLP/blob/16ac6de8b1d5ecd959170ad78ea965ee5fba89a5/src/edu/stanford/nlp/ling/tokensregex/MatchedExpression.java#L321
Obviously, calling containsKey
on a null
causes the exception to be thrown.
The workaround that I used was simply to iterate over the tokens prior to using them and inserting a default new line CoreLabel
element, and it seems to work. Is there a better way to handle that?
My new Star Trek theory is that all Andorians who go into engineering become Aenar:
- pale skin from years of being in computer labs and not getting any sun
- bad vision from reading technical manuals in the dark despite their mothers warning them not to
- psychic abilities to deal with support questions with no code samples
I've never read technical manuals in the dark.
This is the code for either pulling the data from cache or executing the annotator shown above.
protected CoreDocument annotate(String filename) {
try {
CoreDocument fromCache = AnnotationCache.fromCache(filename); // <-- This is the next code block
if (fromCache != null) {
return fromCache;
}
File longerNameForFilename = new File(filename);
String realpath = longerNameForFilename.getCanonicalPath();
Annotation annotation = BasicAnnotator.basicAnnotation(ResourceLoader.loadFile(realpath)); // <-- This is the code block after that
CoreDocument document = new CoreDocument(annotation);
AnnotationSerializer serializer = new ProtobufAnnotationSerializer();
serializer.writeCoreDocument(document, new FileOutputStream(AnnotationCache.getCacheFilename(filename)));
return document;
} catch (IOException | InvalidConfigurationException | ClassNotFoundException | NoSuchAlgorithmException e) {
}
return null;
}
Here is the code that actually the file (if it exists).. This is the code that returns the unserialized document that has the nulls in the token list. It is referenced on the third line above.
public static CoreDocument fromCache(String filename) {
AnnotationSerializer serializer = new ProtobufAnnotationSerializer();
try {
File cacheFilename = getCacheFilename(filename);
if (cacheFilename.canRead()) {
Pair<CoreDocument, InputStream> read = serializer.readCoreDocument(new FileInputStream(cacheFilename));
if (read != null) {
return read.first;
}
}
} catch (NoSuchAlgorithmException | IOException | ClassNotFoundException e) {
logger.error(e.toString());
}
return null;
}
This is the code that does the annotation work. It is referenced on the ninth line in the first code snippet:
public static Annotation basicAnnotation(String text) throws IOException, ClassNotFoundException {
Annotation annotation = new Annotation(text.trim());
if (annotators.size() == 0) {
Properties props = new Properties();
props.setProperty(StanfordCoreNLP.NEWLINE_IS_SENTENCE_BREAK_PROPERTY, "always");
annotators.add(new TokenizerAnnotator(props));
annotators.add(new WordsToSentencesAnnotator(props));
annotators.add(new POSTaggerAnnotator("pos", props));
annotators.add(new NERCombinerAnnotator(props));
annotators.add(new EntityMentionsAnnotator());
}
annotators.forEach(annotator -> annotator.annotate(annotation));
return annotation;
}
This abbreviated code in the test is something like this (there's a bunch of code that's irrelevant to this problem):
class DDDD {
private CoreMapExpressionExtractor expressionMatcher;
public DDDD(CoreDocument document) {
this.document = document;
Env env = TokenSequencePattern.getNewEnv();
expressionMatcher = CoreMapExpressionExtractor.createExtractorFromString(
env, "..."
);
}
private Map<String, String> getOrganizationAliases()
{
HashMap<String, String> aliases = new HashMap<>();
// This line is what triggers the NullPointerException <---
List matches = expressionMatcher.extractExpressions(document.annotation());
return aliases;
}
}
The test code works fine if I delete the cache file, causing the annotation to run from scratch. But when I run it from the cached file all of the new line CoreLabel
instances on the CoreAnnotations.TokensAnnotation.class
annotation are replaced with NULL
. I can make it work by replacing the nulls with new *NL*
CoreLabel
instances.
Thank you, that makes it much easier.
That does not even keep NLs as tokens for me. What version are you using? Is there something else you do to keep the NLs?
Here is a complete program version of the functions you sent me (mostly). If I run it on a text file foo.txt
, it does not have any NL tokens in it, nor does it crash when I reload the annotation.
Is there a reason you chose not to use a StanfordCoreNLP
object, btw?
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.tokensregex.*;
import edu.stanford.nlp.ling.tokensregex.parser.*;
import edu.stanford.nlp.util.Pair;
import java.io.*;
import java.util.*;
import java.util.regex.*;
import java.util.stream.*;
public class foo {
public static File getCacheFilename(String filename) {
return new File(filename + ".cached");
}
public static Annotation basicAnnotation(String text) throws IOException, ClassNotFoundException {
Annotation annotation = new Annotation(text.trim());
List<Annotator> annotators = new ArrayList<>();
Properties props = new Properties();
props.setProperty(StanfordCoreNLP.NEWLINE_IS_SENTENCE_BREAK_PROPERTY, "always");
annotators.add(new TokenizerAnnotator(props));
// not needed in 4.5.0
//annotators.add(new WordsToSentencesAnnotator(props));
annotators.add(new POSTaggerAnnotator("pos", props));
annotators.add(new NERCombinerAnnotator(props));
annotators.add(new EntityMentionsAnnotator());
annotators.forEach(annotator -> annotator.annotate(annotation));
return annotation;
}
public static CoreDocument fromCache(String filename) {
AnnotationSerializer serializer = new ProtobufAnnotationSerializer();
try {
File cacheFilename = getCacheFilename(filename);
if (cacheFilename.canRead()) {
Pair<CoreDocument, InputStream> read = serializer.readCoreDocument(new FileInputStream(cacheFilename));
if (read != null) {
return read.first;
}
}
} catch (IOException | ClassNotFoundException e) {
throw new RuntimeException(e);
}
return null;
}
protected static CoreDocument annotate(String filename) {
try {
CoreDocument fromCache = fromCache(filename); // <-- This is the next code block
if (fromCache != null) {
return fromCache;
}
Annotation annotation = basicAnnotation(IOUtils.slurpFileNoExceptions(filename)); // <-- This is the code block after that
CoreDocument document = new CoreDocument(annotation);
AnnotationSerializer serializer = new ProtobufAnnotationSerializer();
serializer.writeCoreDocument(document, new FileOutputStream(getCacheFilename(filename)));
return document;
} catch (IOException | ClassNotFoundException e) {
throw new RuntimeException(e);
}
}
public static void main(String[] args) throws IOException, ParseException, TokenSequenceParseException {
CoreDocument document = annotate("foo.txt");
for (CoreLabel word : document.tokens()) {
System.out.println(word);
}
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
String rule = "{ ruleType: \"tokens\", pattern: ([{word:\"I\"}] [{word:/like|love/} & {tag:\"VBP\"}] ([{word:\"pizza\"}])), action: Annotate($1, ner, \"FOOD\"), result: \"PIZZA\" }";
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromString(env, rule);
extractor.extractExpressions(document.annotation());
}
}
Well, I know it's lame but this works:
Class tokensAnnotationKey = EnvLookup.getDefaultTokensAnnotationKey(env);
List<CoreLabel> tokens = (List<CoreLabel>)document.annotation().get(tokensAnnotationKey);
int size = tokens.size();
for (int i = 0; i < size; i++) {
CoreLabel l = tokens.get(i);
if (l == null) {
l = new CoreLabel();
l.setValue("*NL*");
l.setOriginalText("\r\n");
l.setBefore("");
l.setAfter("");
l.setBeginPosition(i);
l.setEndPosition(i+1);
tokens.set(i, l);
}
}
The reason I didn't use StandardCoreNLP
was largely because I gained a little more control without losing functionality by calling the annotators directly in this testing scenario. IIRC, I ran into some difficulties in configuring the AnnotationPipeline
which I expediently solved, without regard for future development, by calling the annotators directly.
I'm using CoreNLP 4.4.0.
I suppose, but really the sentence splitter should just be removing the newlines for you. I took the code blocks you pasted above and combined them into one program, as shown in my previous response. When I run that program, I don't get any newlines as tokens. This is true both with 4.4.0 and the newly released 4.5.0.
The tokenizer is improved in 4.5.0, though, and gets the Star Trek names I was using as a test case correct, so I recommend switching regardless.
If you see something I did differently in foo.java
from what you're doing to produce newlines, LMK and I'll take a look at any updates you have.
Any further progress on this issue? I don't think there should be newline tokens at all with the code you sent us, and I certainly don't think it should turn those into null
. Especially some modification to foo.java
which causes the error to happen would be useful
I've not had a chance to go into it, but it's still on my TODO list.
When upgrading to 4.5 a number of my tests started breaking and I'll need to fix those before being certain, but it looks like the NullPointerException
is now gone.
If it's working now, that's great, but I will say that the foo.java
program from above works on both 4.4.0 and 4.5.0 for me.
Ahh, I got it to replicate using the following text for foo.txt
:
I love pizza.
I like pizza.
I REALLY like pizza so much that I love pizza.
But only like a friend.
When I run it I get:
[main] INFO edu.stanford.nlp.ling.tokensregex.types.Expressions - Unknown variable: ner
[main] INFO edu.stanford.nlp.ling.tokensregex.types.Expressions - Unknown variable: ner
[main] INFO edu.stanford.nlp.ling.tokensregex.types.Expressions - Unknown variable: ner
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "edu.stanford.nlp.util.CoreMap.containsKey(java.lang.Class)" because "cm" is null
at edu.stanford.nlp.ling.tokensregex.MatchedExpression.replaceMergedUsingTokenOffsets(MatchedExpression.java:321)
at edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor.extractExpressions(CoreMapExpressionExtractor.java:493)
at foo.main(foo.java:87)
If I replace the fromCache
method with this:
public static CoreDocument fromCache(String filename) {
AnnotationSerializer serializer = new ProtobufAnnotationSerializer();
try {
File cacheFilename = getCacheFilename(filename);
// cacheFilename.delete();
if (cacheFilename.canRead()) {
Pair<CoreDocument, InputStream> read = serializer.readCoreDocument(new FileInputStream(cacheFilename));
if (read != null) {
List<CoreLabel> tokens = (List<CoreLabel>)read.first.annotation().get(CoreAnnotations.TokensAnnotation.class);
int size = tokens.size();
for (int i = 0; i < size; i++) {
CoreLabel l = tokens.get(i);
if (l == null) {
l = new CoreLabel();
l.setValue("*NL*");
l.setOriginalText("\r\n");
l.setBefore("");
l.setAfter("");
l.setBeginPosition(i);
l.setEndPosition(i+1);
tokens.set(i, l);
}
}
return read.first;
}
}
} catch (IOException | ClassNotFoundException e) {
throw new RuntimeException(e);
}
return null;
}
the exception goes away.
Would you send me the exact text file you are using? What OS, Java version or other circumstances might I need to know about? When I copy and paste that into foo.txt
, then run it a couple times, I don't get any kind of exception.
Is there some weird possibility of having different versions of CoreNLP in your classpath?
Oh! I figured it out. The text file was using Windows EOLs. When I switched to Unix LF the problem went away. When I switched back to \r\n
the problem returned.
That's fascinating. Thank you for spotting the difference.
What's weird is I would absolutely expect \n
to be detected as newlines as well. And they get turned into *NL*
in the tokenizer, but then removed in the sentence splitter without splitting sentences in the case of \n
.
Alright, I isolated this to the following block of code in the TokenizerAnnotator
/**
* set isNewline()
*/
private static void setNewlineStatus(List<CoreLabel> tokensList) {
// label newlines
for (CoreLabel token : tokensList) {
if (token.word().equals(AbstractTokenizer.NEWLINE_TOKEN) && (token.endPosition() - token.beginPosition() == 1))
token.set(CoreAnnotations.IsNewlineAnnotation.class, true);
else
token.set(CoreAnnotations.IsNewlineAnnotation.class, false);
}
}
How soon do you need a fix, or is the knowledge that it's the Windows newlines causing the problem enough to avoid future problems?
On Thu, Aug 11, 2022 at 12:16 PM Kevin Schroeder @.***> wrote:
Oh! I figured it out. The text file was using Windows EOLs. When I switched to Unix LF the problem went away. When I switched back to \r\n the problem returned.
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1291#issuecomment-1212389168, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWO5KQVZ47VT7NGJRM3VYVGRHANCNFSM55SYFM2Q . You are receiving this because you commented.Message ID: @.***>
Since I only use this to cache annotations to avoid reprocessing during tests, and I have a workaround, I'm in no immediate need.
FWIW there is a preview of what I believe is a fix in
https://nlp.stanford.edu/software/stanford-corenlp-4.5.0b.zip
a link which will probably expire whenever an actual bugfix release gets published
Thanks. I'll try to give it a shot if I get some time.