CoreNLP
CoreNLP copied to clipboard
ner.applyFineGrained and PERSON entity annotation
When using ner.applyFineGrained
set to true
the NER annotator will get confused in some circumstances like in this phrase
George Washington went to Washington
in this case the term George
will have any annotation i.e. a O
value in the output:
{
"sentences": [{
"index": 0,
"text": "George Washington went to Washington",
"line": 1,
"sentimentValue": "1",
"tokens": [{
"index": 1,
"word": "George",
"characterOffsetBegin": 0,
"characterOffsetEnd": 6,
"before": "",
"after": " ",
"pos": "NNP",
"ner": "O",
"lemma": "George"
},
{
"index": 2,
"word": "Washington",
"characterOffsetBegin": 7,
"characterOffsetEnd": 17,
"before": " ",
"after": " ",
"pos": "NNP",
"ner": "STATE_OR_PROVINCE"
},
{
"index": 3,
"word": "went",
"characterOffsetBegin": 18,
"characterOffsetEnd": 22,
"before": " ",
"after": " ",
"pos": "VBD",
"ner": "O"
},
{
"index": 4,
"word": "to",
"characterOffsetBegin": 23,
"characterOffsetEnd": 25,
"before": " ",
"after": " ",
"pos": "TO",
"ner": "O"
},
{
"index": 5,
"word": "Washington",
"characterOffsetBegin": 26,
"characterOffsetEnd": 36,
"before": " ",
"after": "",
"pos": "NNP",
"ner": "STATE_OR_PROVINCE"
}
]
}
While when set to false
, the Annotator will correctly detect the NER George
, so the output will look like
{
"sentences": [{
"index": 0,
"text": "George Washington went to Washington",
"line": 1,
"sentimentValue": "1",
"tokens": [{
"index": 1,
"word": "George",
"characterOffsetBegin": 0,
"characterOffsetEnd": 6,
"before": "",
"after": " ",
"pos": "NNP",
"ner": "PERSON",
"lemma": "George",
"phoneme": "ʤɔˈɹʤ",
},
{
"index": 2,
"word": "Washington",
"characterOffsetBegin": 7,
"characterOffsetEnd": 17,
"before": " ",
"after": " ",
"pos": "NNP",
"ner": "PERSON",
"lemma": "Washington",
},
{
"index": 3,
"word": "went",
"characterOffsetBegin": 18,
"characterOffsetEnd": 22,
"before": " ",
"after": " ",
"pos": "VBD",
"ner": "O",
"lemma": "go"
},
{
"index": 4,
"word": "to",
"characterOffsetBegin": 23,
"characterOffsetEnd": 25,
"before": " ",
"after": " ",
"pos": "TO",
"ner": "O",
"lemma": "to"
},
{
"index": 5,
"word": "Washington",
"characterOffsetBegin": 26,
"characterOffsetEnd": 36,
"before": " ",
"after": "",
"pos": "NNP",
"ner": "LOCATION",
"lemma": "Washington"
}
]
}]
}
Any reason for this behavior?
I cannot reproduce this error (using 3.9.2 or GitHub latest code). Could you provide more details about the context?
Command I used:
java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.applyFineGrained -file example.txt -outputFormat text
@J38 thanks a lot for the debugging. I digget a bit in the code, and I realized that this happens in this very specific use case:
- The entity is composed of more than one token (hence
George Washington
) - We use the
ner.applyFineGrained
with our custom annotator that extends theSentenceAnnotator
and it uses theNERClassifierCombiner
to recognize the new entity type ARTIST we have defined.
While given the text George went to Washington, Rihanna is an artist
, when the entity is a single token (hence George
) it works as expected: we recognize both the base PERSON entity and our ARTIST entities:
"annotations": {
"sentences": [
{
"index": 0,
"text": "George went to Washington, Rihanna is an artist",
"line": 1,
"structure": "A0",
"paragraphIndex": 0,
"paragraphStructure": "A0",
"tokens": [
{
"index": 1,
"word": "George",
"characterOffsetBegin": 0,
"characterOffsetEnd": 6,
"before": "",
"after": " ",
"pos": "NNP",
"ner": "PERSON",
"lemma": "George",
"snippet": "George went to Washington, Rihanna is an artist",
"entityDelimiter": "U"
},
...
{
"index": 4,
"word": "Washington",
"characterOffsetBegin": 15,
"characterOffsetEnd": 25,
"before": " ",
"after": "",
"pos": "NNP",
"ner": "STATE_OR_PROVINCE",
"lemma": "Washington",
"snippet": "George went to Washington, Rihanna is an artist",
"entityDelimiter": "U"
},
...
{
"index": 6,
"word": "Rihanna",
"characterOffsetBegin": 27,
"characterOffsetEnd": 34,
"before": " ",
"after": " ",
"pos": "NNP",
"ner": "ARTIST",
"lemma": "Rihanna",
"mxmID": "33491890",
"snippet": "George went to Washington, Rihanna is an artist",
"entityDelimiter": "U"
},
...
],
In this case we run this configuration of ner.fine.regexner.mapping"
:
"ner.applyFineGrained": true,
"ner.fine.regexner.mapping": "header=true,mxm_nlpdata/mxm_casedentities.tab;ignorecase=true,edu/stanford/nlp/models/kbp/regexner_caseless.tab;edu/stanford/nlp/models/kbp/regexner_cased.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_entities.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_artists.tab;mxm_nlpdata/mxm_labels.tab;ignorecase=true, mxm_nlpdata/mxm_blacklist.tab"
So it seems that our custom SentenceAnnotator
when overrides the annotate
method it fails:
@Override
public void annotate(Annotation annotation) {
if (VERBOSE) {
log.info("Adding NER Combiner annotation ... ");
}
// if ner.usePresentDateForDocDate is set, use the present date as the doc date
if (usePresentDateForDocDate) {
String currentDate =
new SimpleDateFormat("yyyy-MM-dd").format(Calendar.getInstance().getTime());
annotation.set(CoreAnnotations.DocDateAnnotation.class, currentDate);
}
// use provided doc date if applicable
if (!providedDocDate.equals("")) {
annotation.set(CoreAnnotations.DocDateAnnotation.class, providedDocDate);
}
AnnotationsMask mask = new AnnotationsMask(true);
Annotation maskedAnnotation = mask.decompose(annotation);
super.annotate(maskedAnnotation);
this.ner.finalizeAnnotation(maskedAnnotation);
if (VERBOSE) {
log.info("done.");
}
// if Spanish, run the regexner with Spanish number rules
if (LanguageInfo.HumanLanguage.SPANISH.equals(language))
spanishNumberAnnotator.annotate(maskedAnnotation);
// if fine grained ner is requested, run that
if (this.applyFineGrained) {
fineGrainedNERAnnotator.annotate(maskedAnnotation);
// set the FineGrainedNamedEntityTagAnnotation.class
for (CoreLabel token : maskedAnnotation.get(CoreAnnotations.TokensAnnotation.class)) {
String fineGrainedTag = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
token.set(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, fineGrainedTag);
}
}
// if entity mentions should be built, run that
if (this.buildEntityMentions)
entityMentionsAnnotator.annotate(maskedAnnotation);
Map<Class, Object> mapped_defaults = new HashMap<>();
mapped_defaults.put(CoreAnnotations.NamedEntityTagAnnotation.class, "O");
mapped_defaults.put(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class, null);
mapped_defaults.put(MXMCoreAnnotations.MXMSlangCorrectionAnnotation.class, null);
mapped_defaults.put(MXMCoreAnnotations.MXMEntityID.class, null);
mapped_defaults.put(CoreAnnotations.LinkAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.ValueAnnotation.class, null);
mapped_defaults.put(TimeExpression.Annotation.class, null);
mapped_defaults.put(TimeExpression.TimeIndexAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.DistSimAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumericCompositeTypeAnnotation.class, null);
mapped_defaults.put(TimeExpression.ChildrenAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumericTypeAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.ShapeAnnotation.class, null);
mapped_defaults.put(Tags.TagsAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumerizedTokensAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.AnswerAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumericCompositeValueAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.CoarseNamedEntityTagAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, null);
annotation = mask.recompose(annotation, maskedAnnotation, mapped_defaults);
}
Could you show me the pipeline settings? Did you create a statistical model to tag "ARTIST" ?
Also for reference, here is the latest write up on the NER process, which is pretty detailed about each step:
https://stanfordnlp.github.io/CoreNLP/ner.html
@J38 yes of course. My configuration looks like this
var options = {
"lang": "en",
"annotators": "tokenize,mxmssplit,mxmslang,mxmphonetics,mxmsegmenter,mxmpos,mxmlemma,mxmner,mxmsentiment",
// POS
"customAnnotatorClass.mxmpos": "musixmatch_nlp.MXMPartOfSpeechAnnotator",
// LEMMATIZER
"customAnnotatorClass.mxmlemma": "musixmatch_nlp.MXMMorphaAnnotator",
// PHONEMES
"customAnnotatorClass.mxmphonetics": "musixmatch_nlp.MXMPhoneticsAnnotator",
// SEGMENTER
"customAnnotatorClass.mxmsegmenter": "musixmatch_nlp.MXMLyricsSegmenterAnnotator",
// SLANG
"customAnnotatorClass.mxmslang": "musixmatch_nlp.MXMSlangCorrector",
// NER
"customAnnotatorClass.mxmner": "musixmatch_nlp.MXMNERCombinerAnnotator",
// SPLIT
"customAnnotatorClass.mxmssplit": "musixmatch_nlp.MXMWordToSentencesAnnotator",
// SENTIMENT
"customAnnotatorClass.mxmsentiment": "musixmatch_nlp.MXMSentimentTensorflowAnnotator",
"mxmphonetics.ipa_dict": "/root/en_cmuipadict.txt",
"mxmsentiment.model_dir": "/root/blstm_att1530026090",
"mxmslang.language": "en",
"ssplit.newlineIsSentenceBreak": "always",
"ner.applyFineGrained": true,
"ner.buildEntityMentions": false,
"ner.fine.regexner.mapping": "header=true,mxm_nlpdata/mxm_casedentities.tab;ignorecase=true,edu/stanford/nlp/models/kbp/regexner_caseless.tab;edu/stanford/nlp/models/kbp/regexner_cased.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_entities.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_artists.tab;mxm_nlpdata/mxm_labels.tab;ignorecase=true, mxm_nlpdata/mxm_blacklist.tab"
};
We have several class extensions here, while the important stuff here related to the NER classifier is the mxmner
and its configuration "musixmatch_nlp.MXMNERCombinerAnnotator"
.
You can find above the Java class that implements the MXMNERCombinerAnnotator
that extends the SentenceAnnotator
.
Basically it normally works and tags the new ARTIST tag. It fails in the case presented above when having these multiple tokens.
@J38 Any idea why this happens? This is my annotate
override in the java annotator class
@Override
public void annotate(Annotation annotation) {
if (VERBOSE) {
log.info("Adding NER Combiner annotation ... ");
}
// if ner.usePresentDateForDocDate is set, use the present date as the doc date
if (usePresentDateForDocDate) {
String currentDate =
new SimpleDateFormat("yyyy-MM-dd").format(Calendar.getInstance().getTime());
annotation.set(CoreAnnotations.DocDateAnnotation.class, currentDate);
}
// use provided doc date if applicable
if (!providedDocDate.equals("")) {
annotation.set(CoreAnnotations.DocDateAnnotation.class, providedDocDate);
}
AnnotationsMask mask = new AnnotationsMask(true);
Annotation maskedAnnotation = mask.decompose(annotation);
super.annotate(maskedAnnotation);
this.ner.finalizeAnnotation(maskedAnnotation);
if (VERBOSE) {
log.info("done.");
}
// if Spanish, run the regexner with Spanish number rules
if (LanguageInfo.HumanLanguage.SPANISH.equals(language))
spanishNumberAnnotator.annotate(maskedAnnotation);
// if fine grained ner is requested, run that
if (this.applyFineGrained) {
fineGrainedNERAnnotator.annotate(maskedAnnotation);
// set the FineGrainedNamedEntityTagAnnotation.class
for (CoreLabel token : maskedAnnotation.get(CoreAnnotations.TokensAnnotation.class)) {
String fineGrainedTag = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
token.set(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, fineGrainedTag);
}
}
// if entity mentions should be built, run that
if (this.buildEntityMentions)
entityMentionsAnnotator.annotate(maskedAnnotation);
Map<Class, Object> mapped_defaults = new HashMap<>();
mapped_defaults.put(CoreAnnotations.NamedEntityTagAnnotation.class, "O");
mapped_defaults.put(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class, null);
mapped_defaults.put(MXMCoreAnnotations.MXMSlangCorrectionAnnotation.class, null);
mapped_defaults.put(MXMCoreAnnotations.MXMEntityID.class, null);
mapped_defaults.put(CoreAnnotations.LinkAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.ValueAnnotation.class, null);
mapped_defaults.put(TimeExpression.Annotation.class, null);
mapped_defaults.put(TimeExpression.TimeIndexAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.DistSimAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumericCompositeTypeAnnotation.class, null);
mapped_defaults.put(TimeExpression.ChildrenAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumericTypeAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.ShapeAnnotation.class, null);
mapped_defaults.put(Tags.TagsAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumerizedTokensAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.AnswerAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumericCompositeValueAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.CoarseNamedEntityTagAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, null);
annotation = mask.recompose(annotation, maskedAnnotation, mapped_defaults);
}