CoreNLP Tokenizer splitHyphenated regression

The following snippet of code seems to correctly split on the hyphen in "year-end" in 3.9.2, but no longer in 4.4.0. Is this expected behavior?

public static void main(String[] args) {
    String text = "year-end";
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.language", "en");
    props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation ann = new Annotation(text);
    pipeline.annotate(ann);
    List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
    System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
}

Old output: [year, -, end] New output: [year-end]

Jul 27 '22 18:07 gangeli

My man I do not see any issue here

import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;

import java.util.*;
import java.util.stream.*;

public class foo {
  public static void main(String[] args) {
    String text = "year-end";
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.language", "en");
    props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation ann = new Annotation(text);
    pipeline.annotate(ann);
    List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
    System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
  }
}

java foo
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[year, -, end]

Jul 27 '22 21:07 AngledLuffa

That is what happens if I use v4.4.0 via git checkout v4.4.0 or if I use v4.5.0 in my git clone

Jul 27 '22 21:07 AngledLuffa

Well that's strange. Maybe some library interference? I've tried isolating the error as best as I can, and still get it:

# lib/main has all of our classpath entries
$ find lib/main -name "*.jar" | grep stanford 
lib/main/edu.stanford.nlp_stanford-corenlp_4.4.0.jar


$ unzip -p lib/main/edu.stanford.nlp_stanford-corenlp_4.4.0.jar META-INF/MANIFEST.MF
Manifest-Version: 1.0
Implementation-Version: 4.4.0
Built-Date: 2022-01-20
Created-By: Stanford JavaNLP (jebolton)
Main-class: edu.stanford.nlp.pipeline.StanfordCoreNLP


$ cat foo.java                                                                      
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;

import java.util.*;
import java.util.stream.*;

public class foo {
  public static void main(String[] args) {
    String text = "year-end";
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.language", "en");
    props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation ann = new Annotation(text);
    pipeline.annotate(ann);
    List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
    System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
  }
}


$ "$JAVA_HOME/bin/javac" foo.java
OpenJDK 64-Bit Server VM warning: .hotspot_compiler file is present but has been ignored.  Run with -XX:CompileCommandFile=.hotspot_compiler to load the file.


$ "$JAVA_HOME/bin/java" foo      
OpenJDK 64-Bit Server VM warning: .hotspot_compiler file is present but has been ignored.  Run with -XX:CompileCommandFile=.hotspot_compiler to load the file.
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#noProviders for further details.
[year-end]

Maybe it's an Antlr version issue? We have Antlr Runtime 4.7.2

Jul 27 '22 23:07 gangeli

One step closer: apparently if I remove ptb3Escaping=true from the options then it works as expected. Gonna dig into the Lexer more, but it looks like ptb3Escaping has its own opinions about hyphenation, and there's some ordering indeterminacy around whose opinions matter more.

Jul 28 '22 00:07 gangeli

Blarg, the decompiler is barfing on PTBLexer and not letting me set breakpoints, but I have pretty good evidence that this is indeed the case.

Consider the block of code in the PTBLexer.flex constructor starting here:

        Properties prop = StringUtils.stringToProperties(options);
        Set<Map.Entry<Object,Object>> props = prop.entrySet();
        for (Map.Entry<Object,Object> item : props) {
          String key = (String) item.getKey();
          String value = (String) item.getValue();
          boolean val = Boolean.parseBoolean(value);
          if ("".equals(key)) {
            // allow an empty item
//...
          } else if ("ptb3Escaping".equals(key)) {
//...
            splitHyphenated = ! val;
//...
          } else if ("ud".equals(key)) {
//...
            splitHyphenated=val;
//...
          } else if ("splitHyphenated".equals(key)) {
            splitHyphenated = val;
          }

If I inspect props (fortunately, StringUtils still decompiles) via props.entrySet().iterator().next() I get splitHyphenated -> true, which suggests that ptb3Escaping comes later in the property set and thus overwrites the splitHyphenated value.

Are ptb3Escaping and splitHyphenated truly incompatible or is this accidental?

Jul 28 '22 00:07 gangeli

Well this might wind up being horrible. I tried on a couple different Java 8 installs and got the desired behavior in both, but with a Java 11 and a Java 14 install I got the same error you did. What java version are you running?

Maybe the string hash function changed between versions, and thus the keys are iterated in a different order? I guess the simplest fix in that case would be to make the later keys override the earlier ones in a deterministic order.

Jul 28 '22 01:07 AngledLuffa

Am certain now that it is the key order in the Properties object causing this problem

While we come up with some sort of fix, in the meantime, you could always set the splitHyphenated property of the Lexer to whatever value you need...

Jul 28 '22 02:07 AngledLuffa

So, to what extent is this an issue where you would need a quick fix, versus being able to work around it (such as by setting the appropriate option in the Lexer after creating it) until the next release is made?

Jul 28 '22 21:07 AngledLuffa

The fix for the tokenizer is now in dev branch. I would like to fix this in the Parser as well, but that requires serializing all the models again. Please leave this open in the meantime!

Aug 04 '22 20:08 AngledLuffa

CoreNLP CoreNLP copied to clipboard

Tokenizer splitHyphenated regression

CoreNLP
CoreNLP copied to clipboard