CoreNLP NullPointerException with AnnotationSerializer (ProtobufAnnotationSerializer) and CoreMapExpressionExtractor

I have a test that I run on CoreMapExpressionExtractor which relies on caching the result of an annotation run so I don't have to run the parser each time. I use the ProtobufAnnotationSerializer. However, when I unserialize the cached data all of the new line tokens have been set to NULL. The result is that when I run the extractor a NullPointerException gets thrown at https://github.com/stanfordnlp/CoreNLP/blob/16ac6de8b1d5ecd959170ad78ea965ee5fba89a5/src/edu/stanford/nlp/ling/tokensregex/MatchedExpression.java#L321

Obviously, calling containsKey on a null causes the exception to be thrown.

The workaround that I used was simply to iterate over the tokens prior to using them and inserting a default new line CoreLabel element, and it seems to work. Is there a better way to handle that?

Aug 04 '22 14:08 kschroeder

My new Star Trek theory is that all Andorians who go into engineering become Aenar:

pale skin from years of being in computer labs and not getting any sun
bad vision from reading technical manuals in the dark despite their mothers warning them not to
psychic abilities to deal with support questions with no code samples

Aug 04 '22 16:08 AngledLuffa

I've never read technical manuals in the dark.

This is the code for either pulling the data from cache or executing the annotator shown above.

    protected CoreDocument annotate(String filename) {
        try {
            CoreDocument fromCache = AnnotationCache.fromCache(filename); // <-- This is the next code block
            if (fromCache != null) {
                return fromCache;
            }

            File longerNameForFilename = new File(filename);
            String realpath = longerNameForFilename.getCanonicalPath();
            Annotation annotation = BasicAnnotator.basicAnnotation(ResourceLoader.loadFile(realpath)); // <-- This is the code block after that
            CoreDocument document = new CoreDocument(annotation);
            AnnotationSerializer serializer = new ProtobufAnnotationSerializer();
            serializer.writeCoreDocument(document, new FileOutputStream(AnnotationCache.getCacheFilename(filename)));
            return document;
        } catch (IOException | InvalidConfigurationException | ClassNotFoundException | NoSuchAlgorithmException e) {
     }
        return null; 
    }

Here is the code that actually the file (if it exists).. This is the code that returns the unserialized document that has the nulls in the token list. It is referenced on the third line above.

    public static CoreDocument fromCache(String filename)  {
        AnnotationSerializer serializer = new ProtobufAnnotationSerializer();
        try {
            File cacheFilename = getCacheFilename(filename);
            if (cacheFilename.canRead()) {
                Pair<CoreDocument, InputStream> read = serializer.readCoreDocument(new FileInputStream(cacheFilename));
                if (read != null) {
                    return read.first;
                }
            }
        } catch (NoSuchAlgorithmException | IOException | ClassNotFoundException e) {
            logger.error(e.toString());
        }
        return null;
    }

This is the code that does the annotation work. It is referenced on the ninth line in the first code snippet:

    public static Annotation basicAnnotation(String text) throws IOException, ClassNotFoundException {
        Annotation annotation = new Annotation(text.trim());
        if (annotators.size() == 0) {
            Properties props = new Properties();
            props.setProperty(StanfordCoreNLP.NEWLINE_IS_SENTENCE_BREAK_PROPERTY, "always");
            annotators.add(new TokenizerAnnotator(props));
            annotators.add(new WordsToSentencesAnnotator(props));
            annotators.add(new POSTaggerAnnotator("pos", props));
            annotators.add(new NERCombinerAnnotator(props));
            annotators.add(new EntityMentionsAnnotator());

        }
        annotators.forEach(annotator -> annotator.annotate(annotation));
        return annotation;
    }

This abbreviated code in the test is something like this (there's a bunch of code that's irrelevant to this problem):

class DDDD {
    private CoreMapExpressionExtractor expressionMatcher;
    public DDDD(CoreDocument document) {
        this.document = document;
        Env env = TokenSequencePattern.getNewEnv();
        expressionMatcher = CoreMapExpressionExtractor.createExtractorFromString(
            env, "..."
        );
    }

        private Map<String, String> getOrganizationAliases()
        {
            HashMap<String, String> aliases = new HashMap<>();

           //  This line is what triggers the NullPointerException  <---
            List matches = expressionMatcher.extractExpressions(document.annotation());
            return aliases;
        }
}

The test code works fine if I delete the cache file, causing the annotation to run from scratch. But when I run it from the cached file all of the new line CoreLabel instances on the CoreAnnotations.TokensAnnotation.class annotation are replaced with NULL. I can make it work by replacing the nulls with new *NL* CoreLabel instances.

Aug 04 '22 17:08 kschroeder

Thank you, that makes it much easier.

That does not even keep NLs as tokens for me. What version are you using? Is there something else you do to keep the NLs?

Here is a complete program version of the functions you sent me (mostly). If I run it on a text file foo.txt, it does not have any NL tokens in it, nor does it crash when I reload the annotation.

Is there a reason you chose not to use a StanfordCoreNLP object, btw?

import edu.stanford.nlp.pipeline.*;

import edu.stanford.nlp.io.IOUtils;

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.tokensregex.*;
import edu.stanford.nlp.ling.tokensregex.parser.*;

import edu.stanford.nlp.util.Pair;
import java.io.*;
import java.util.*;
import java.util.regex.*;
import java.util.stream.*;

public class foo {
  public static File getCacheFilename(String filename) {
    return new File(filename + ".cached");
  }

  public static Annotation basicAnnotation(String text) throws IOException, ClassNotFoundException {
    Annotation annotation = new Annotation(text.trim());

    List<Annotator> annotators = new ArrayList<>();

    Properties props = new Properties();
    props.setProperty(StanfordCoreNLP.NEWLINE_IS_SENTENCE_BREAK_PROPERTY, "always");
    annotators.add(new TokenizerAnnotator(props));
    // not needed in 4.5.0
    //annotators.add(new WordsToSentencesAnnotator(props));
    annotators.add(new POSTaggerAnnotator("pos", props));
    annotators.add(new NERCombinerAnnotator(props));
    annotators.add(new EntityMentionsAnnotator());
    annotators.forEach(annotator -> annotator.annotate(annotation));

    return annotation;
  }

  public static CoreDocument fromCache(String filename)  {
    AnnotationSerializer serializer = new ProtobufAnnotationSerializer();
    try {
      File cacheFilename = getCacheFilename(filename);
      if (cacheFilename.canRead()) {
        Pair<CoreDocument, InputStream> read = serializer.readCoreDocument(new FileInputStream(cacheFilename));
        if (read != null) {
          return read.first;
        }
      }
    } catch (IOException | ClassNotFoundException e) {
      throw new RuntimeException(e);
    }
    return null;
  }


  protected static CoreDocument annotate(String filename) {
    try {
      CoreDocument fromCache = fromCache(filename); // <-- This is the next code block
      if (fromCache != null) {
        return fromCache;
      }

      Annotation annotation = basicAnnotation(IOUtils.slurpFileNoExceptions(filename)); // <-- This is the code block after that
      CoreDocument document = new CoreDocument(annotation);
      AnnotationSerializer serializer = new ProtobufAnnotationSerializer();
      serializer.writeCoreDocument(document, new FileOutputStream(getCacheFilename(filename)));
      return document;
    } catch (IOException | ClassNotFoundException e) {
      throw new RuntimeException(e);
    }
  }


  public static void main(String[] args) throws IOException, ParseException, TokenSequenceParseException {
    CoreDocument document = annotate("foo.txt");
    for (CoreLabel word : document.tokens()) {
      System.out.println(word);
    }

    Env env = TokenSequencePattern.getNewEnv();
    env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    String rule = "{ ruleType: \"tokens\", pattern: ([{word:\"I\"}] [{word:/like|love/} & {tag:\"VBP\"}] ([{word:\"pizza\"}])), action: Annotate($1, ner, \"FOOD\"), result: \"PIZZA\" }";
    CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromString(env, rule);

    extractor.extractExpressions(document.annotation());
  }
}

Aug 05 '22 00:08 AngledLuffa

Well, I know it's lame but this works:

Class tokensAnnotationKey = EnvLookup.getDefaultTokensAnnotationKey(env);
List<CoreLabel> tokens = (List<CoreLabel>)document.annotation().get(tokensAnnotationKey);
int size = tokens.size();
for (int i = 0; i < size; i++) {
    CoreLabel l = tokens.get(i);
    if (l == null) {
        l = new CoreLabel();
        l.setValue("*NL*");
        l.setOriginalText("\r\n");
        l.setBefore("");
        l.setAfter("");
        l.setBeginPosition(i);
        l.setEndPosition(i+1);
        tokens.set(i, l);
    }
}

The reason I didn't use StandardCoreNLP was largely because I gained a little more control without losing functionality by calling the annotators directly in this testing scenario. IIRC, I ran into some difficulties in configuring the AnnotationPipeline which I expediently solved, without regard for future development, by calling the annotators directly.

I'm using CoreNLP 4.4.0.

Aug 05 '22 16:08 kschroeder

I suppose, but really the sentence splitter should just be removing the newlines for you. I took the code blocks you pasted above and combined them into one program, as shown in my previous response. When I run that program, I don't get any newlines as tokens. This is true both with 4.4.0 and the newly released 4.5.0.

The tokenizer is improved in 4.5.0, though, and gets the Star Trek names I was using as a test case correct, so I recommend switching regardless.

Aug 05 '22 17:08 AngledLuffa

If you see something I did differently in foo.java from what you're doing to produce newlines, LMK and I'll take a look at any updates you have.

Aug 05 '22 17:08 AngledLuffa

Any further progress on this issue? I don't think there should be newline tokens at all with the code you sent us, and I certainly don't think it should turn those into null. Especially some modification to foo.java which causes the error to happen would be useful

Aug 08 '22 23:08 AngledLuffa

I've not had a chance to go into it, but it's still on my TODO list.

Aug 08 '22 23:08 kschroeder

When upgrading to 4.5 a number of my tests started breaking and I'll need to fix those before being certain, but it looks like the NullPointerException is now gone.

Aug 11 '22 13:08 kschroeder

If it's working now, that's great, but I will say that the foo.java program from above works on both 4.4.0 and 4.5.0 for me.

Aug 11 '22 18:08 AngledLuffa

Ahh, I got it to replicate using the following text for foo.txt:

I love pizza.
I like pizza.
I REALLY like pizza so much that I love pizza.
But only like a friend.

When I run it I get:

[main] INFO edu.stanford.nlp.ling.tokensregex.types.Expressions - Unknown variable: ner
[main] INFO edu.stanford.nlp.ling.tokensregex.types.Expressions - Unknown variable: ner
[main] INFO edu.stanford.nlp.ling.tokensregex.types.Expressions - Unknown variable: ner
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "edu.stanford.nlp.util.CoreMap.containsKey(java.lang.Class)" because "cm" is null
	at edu.stanford.nlp.ling.tokensregex.MatchedExpression.replaceMergedUsingTokenOffsets(MatchedExpression.java:321)
	at edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor.extractExpressions(CoreMapExpressionExtractor.java:493)
	at foo.main(foo.java:87)

Aug 11 '22 18:08 kschroeder

If I replace the fromCache method with this:

    public static CoreDocument fromCache(String filename)  {
        AnnotationSerializer serializer = new ProtobufAnnotationSerializer();
        try {
            File cacheFilename = getCacheFilename(filename);
//            cacheFilename.delete();
            if (cacheFilename.canRead()) {
                Pair<CoreDocument, InputStream> read = serializer.readCoreDocument(new FileInputStream(cacheFilename));
                if (read != null) {
                    List<CoreLabel> tokens = (List<CoreLabel>)read.first.annotation().get(CoreAnnotations.TokensAnnotation.class);
                    int size = tokens.size();
                    for (int i = 0; i < size; i++) {
                        CoreLabel l = tokens.get(i);
                        if (l == null) {
                            l = new CoreLabel();
                            l.setValue("*NL*");
                            l.setOriginalText("\r\n");
                            l.setBefore("");
                            l.setAfter("");
                            l.setBeginPosition(i);
                            l.setEndPosition(i+1);
                            tokens.set(i, l);
                        }
                    }
                    return read.first;
                }
            }
        } catch (IOException | ClassNotFoundException e) {
            throw new RuntimeException(e);
        }
        return null;
    }

the exception goes away.

Aug 11 '22 19:08 kschroeder

Would you send me the exact text file you are using? What OS, Java version or other circumstances might I need to know about? When I copy and paste that into foo.txt, then run it a couple times, I don't get any kind of exception.

Is there some weird possibility of having different versions of CoreNLP in your classpath?

Aug 11 '22 19:08 AngledLuffa

Oh! I figured it out. The text file was using Windows EOLs. When I switched to Unix LF the problem went away. When I switched back to \r\n the problem returned.

Aug 11 '22 19:08 kschroeder

That's fascinating. Thank you for spotting the difference.

What's weird is I would absolutely expect \n to be detected as newlines as well. And they get turned into *NL* in the tokenizer, but then removed in the sentence splitter without splitting sentences in the case of \n.

Aug 11 '22 22:08 AngledLuffa

Alright, I isolated this to the following block of code in the TokenizerAnnotator

  /**
   * set isNewline()
   */
  private static void setNewlineStatus(List<CoreLabel> tokensList) {
    // label newlines
    for (CoreLabel token : tokensList) {
      if (token.word().equals(AbstractTokenizer.NEWLINE_TOKEN) && (token.endPosition() - token.beginPosition() == 1))
        token.set(CoreAnnotations.IsNewlineAnnotation.class, true);
      else
        token.set(CoreAnnotations.IsNewlineAnnotation.class, false);
    }
  }

Aug 12 '22 02:08 AngledLuffa

How soon do you need a fix, or is the knowledge that it's the Windows newlines causing the problem enough to avoid future problems?

On Thu, Aug 11, 2022 at 12:16 PM Kevin Schroeder @.***> wrote:

Oh! I figured it out. The text file was using Windows EOLs. When I switched to Unix LF the problem went away. When I switched back to \r\n the problem returned.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1291#issuecomment-1212389168, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWO5KQVZ47VT7NGJRM3VYVGRHANCNFSM55SYFM2Q . You are receiving this because you commented.Message ID: @.***>

Aug 12 '22 03:08 AngledLuffa

Since I only use this to cache annotations to avoid reprocessing during tests, and I have a workaround, I'm in no immediate need.

Aug 12 '22 10:08 kschroeder

FWIW there is a preview of what I believe is a fix in

https://nlp.stanford.edu/software/stanford-corenlp-4.5.0b.zip

a link which will probably expire whenever an actual bugfix release gets published

Aug 16 '22 20:08 AngledLuffa

Thanks. I'll try to give it a shot if I get some time.

Aug 16 '22 21:08 kschroeder

CoreNLP CoreNLP copied to clipboard

NullPointerException with AnnotationSerializer (ProtobufAnnotationSerializer) and CoreMapExpressionExtractor

CoreNLP
CoreNLP copied to clipboard