CoreNLP
CoreNLP copied to clipboard
Tokenizer splitHyphenated regression
The following snippet of code seems to correctly split on the hyphen in "year-end" in 3.9.2, but no longer in 4.4.0. Is this expected behavior?
public static void main(String[] args) {
String text = "year-end";
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit");
props.setProperty("tokenize.language", "en");
props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation ann = new Annotation(text);
pipeline.annotate(ann);
List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
}
Old output: [year, -, end]
New output: [year-end]
My man I do not see any issue here
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import java.util.*;
import java.util.stream.*;
public class foo {
public static void main(String[] args) {
String text = "year-end";
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit");
props.setProperty("tokenize.language", "en");
props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation ann = new Annotation(text);
pipeline.annotate(ann);
List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
}
}
java foo
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[year, -, end]
That is what happens if I use v4.4.0 via git checkout v4.4.0 or if I use v4.5.0 in my git clone
Well that's strange. Maybe some library interference? I've tried isolating the error as best as I can, and still get it:
# lib/main has all of our classpath entries
$ find lib/main -name "*.jar" | grep stanford
lib/main/edu.stanford.nlp_stanford-corenlp_4.4.0.jar
$ unzip -p lib/main/edu.stanford.nlp_stanford-corenlp_4.4.0.jar META-INF/MANIFEST.MF
Manifest-Version: 1.0
Implementation-Version: 4.4.0
Built-Date: 2022-01-20
Created-By: Stanford JavaNLP (jebolton)
Main-class: edu.stanford.nlp.pipeline.StanfordCoreNLP
$ cat foo.java
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import java.util.*;
import java.util.stream.*;
public class foo {
public static void main(String[] args) {
String text = "year-end";
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit");
props.setProperty("tokenize.language", "en");
props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation ann = new Annotation(text);
pipeline.annotate(ann);
List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
}
}
$ "$JAVA_HOME/bin/javac" foo.java
OpenJDK 64-Bit Server VM warning: .hotspot_compiler file is present but has been ignored. Run with -XX:CompileCommandFile=.hotspot_compiler to load the file.
$ "$JAVA_HOME/bin/java" foo
OpenJDK 64-Bit Server VM warning: .hotspot_compiler file is present but has been ignored. Run with -XX:CompileCommandFile=.hotspot_compiler to load the file.
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#noProviders for further details.
[year-end]
Maybe it's an Antlr version issue? We have Antlr Runtime 4.7.2
One step closer: apparently if I remove ptb3Escaping=true from the options then it works as expected. Gonna dig into the Lexer more, but it looks like ptb3Escaping has its own opinions about hyphenation, and there's some ordering indeterminacy around whose opinions matter more.
Blarg, the decompiler is barfing on PTBLexer and not letting me set breakpoints, but I have pretty good evidence that this is indeed the case.
Consider the block of code in the PTBLexer.flex constructor starting here:
Properties prop = StringUtils.stringToProperties(options);
Set<Map.Entry<Object,Object>> props = prop.entrySet();
for (Map.Entry<Object,Object> item : props) {
String key = (String) item.getKey();
String value = (String) item.getValue();
boolean val = Boolean.parseBoolean(value);
if ("".equals(key)) {
// allow an empty item
//...
} else if ("ptb3Escaping".equals(key)) {
//...
splitHyphenated = ! val;
//...
} else if ("ud".equals(key)) {
//...
splitHyphenated=val;
//...
} else if ("splitHyphenated".equals(key)) {
splitHyphenated = val;
}
If I inspect props (fortunately, StringUtils still decompiles) via props.entrySet().iterator().next() I get splitHyphenated -> true, which suggests that ptb3Escaping comes later in the property set and thus overwrites the splitHyphenated value.
Are ptb3Escaping and splitHyphenated truly incompatible or is this accidental?
Well this might wind up being horrible. I tried on a couple different Java 8 installs and got the desired behavior in both, but with a Java 11 and a Java 14 install I got the same error you did. What java version are you running?
Maybe the string hash function changed between versions, and thus the keys are iterated in a different order? I guess the simplest fix in that case would be to make the later keys override the earlier ones in a deterministic order.
Am certain now that it is the key order in the Properties object causing this problem
While we come up with some sort of fix, in the meantime, you could always set the splitHyphenated property of the Lexer to whatever value you need...
So, to what extent is this an issue where you would need a quick fix, versus being able to work around it (such as by setting the appropriate option in the Lexer after creating it) until the next release is made?
The fix for the tokenizer is now in dev branch. I would like to fix this in the Parser as well, but that requires serializing all the models again. Please leave this open in the meantime!