CoreNLP icon indicating copy to clipboard operation
CoreNLP copied to clipboard

ner.applyFineGrained and PERSON entity annotation

Open loretoparisi opened this issue 6 years ago • 5 comments

When using ner.applyFineGrained set to true the NER annotator will get confused in some circumstances like in this phrase

George Washington went to Washington

in this case the term George will have any annotation i.e. a O value in the output:

{
	"sentences": [{
				"index": 0,
				"text": "George Washington went to Washington",
				"line": 1,
				"sentimentValue": "1",
				"tokens": [{
						"index": 1,
						"word": "George",
						"characterOffsetBegin": 0,
						"characterOffsetEnd": 6,
						"before": "",
						"after": " ",
						"pos": "NNP",
						"ner": "O",
						"lemma": "George"
					},
					{
						"index": 2,
						"word": "Washington",
						"characterOffsetBegin": 7,
						"characterOffsetEnd": 17,
						"before": " ",
						"after": " ",
						"pos": "NNP",
						"ner": "STATE_OR_PROVINCE"
					},
					{
						"index": 3,
						"word": "went",
						"characterOffsetBegin": 18,
						"characterOffsetEnd": 22,
						"before": " ",
						"after": " ",
						"pos": "VBD",
						"ner": "O"
					},
					{
						"index": 4,
						"word": "to",
						"characterOffsetBegin": 23,
						"characterOffsetEnd": 25,
						"before": " ",
						"after": " ",
						"pos": "TO",
						"ner": "O"
					},
					{
						"index": 5,
						"word": "Washington",
						"characterOffsetBegin": 26,
						"characterOffsetEnd": 36,
						"before": " ",
						"after": "",
						"pos": "NNP",
						"ner": "STATE_OR_PROVINCE"
					}
				]
			}

While when set to false, the Annotator will correctly detect the NER George, so the output will look like

{
	"sentences": [{
		"index": 0,
		"text": "George Washington went to Washington",
		"line": 1,
		"sentimentValue": "1",
		"tokens": [{
				"index": 1,
				"word": "George",
				"characterOffsetBegin": 0,
				"characterOffsetEnd": 6,
				"before": "",
				"after": " ",
				"pos": "NNP",
				"ner": "PERSON",
				"lemma": "George",
				"phoneme": "ʤɔˈɹʤ",
			},
			{
				"index": 2,
				"word": "Washington",
				"characterOffsetBegin": 7,
				"characterOffsetEnd": 17,
				"before": " ",
				"after": " ",
				"pos": "NNP",
				"ner": "PERSON",
				"lemma": "Washington",
			},
			{
				"index": 3,
				"word": "went",
				"characterOffsetBegin": 18,
				"characterOffsetEnd": 22,
				"before": " ",
				"after": " ",
				"pos": "VBD",
				"ner": "O",
				"lemma": "go"
			},
			{
				"index": 4,
				"word": "to",
				"characterOffsetBegin": 23,
				"characterOffsetEnd": 25,
				"before": " ",
				"after": " ",
				"pos": "TO",
				"ner": "O",
				"lemma": "to"
			},
			{
				"index": 5,
				"word": "Washington",
				"characterOffsetBegin": 26,
				"characterOffsetEnd": 36,
				"before": " ",
				"after": "",
				"pos": "NNP",
				"ner": "LOCATION",
				"lemma": "Washington"
			}
		]
	}]
}

Any reason for this behavior?

loretoparisi avatar Jan 29 '19 10:01 loretoparisi

I cannot reproduce this error (using 3.9.2 or GitHub latest code). Could you provide more details about the context?

Command I used:

java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.applyFineGrained -file example.txt -outputFormat text

J38 avatar Feb 07 '19 08:02 J38

@J38 thanks a lot for the debugging. I digget a bit in the code, and I realized that this happens in this very specific use case:

  1. The entity is composed of more than one token (hence George Washington)
  2. We use the ner.applyFineGrained with our custom annotator that extends the SentenceAnnotator and it uses the NERClassifierCombiner to recognize the new entity type ARTIST we have defined.

While given the text George went to Washington, Rihanna is an artist, when the entity is a single token (hence George) it works as expected: we recognize both the base PERSON entity and our ARTIST entities:

"annotations": {
    "sentences": [
      {
        "index": 0,
        "text": "George went to Washington, Rihanna is an artist",
        "line": 1,
        "structure": "A0",
        "paragraphIndex": 0,
        "paragraphStructure": "A0",
        "tokens": [
          {
            "index": 1,
            "word": "George",
            "characterOffsetBegin": 0,
            "characterOffsetEnd": 6,
            "before": "",
            "after": " ",
            "pos": "NNP",
            "ner": "PERSON",
            "lemma": "George",
            "snippet": "George went to Washington, Rihanna is an artist",
            "entityDelimiter": "U"
          },
          ...
          {
            "index": 4,
            "word": "Washington",
            "characterOffsetBegin": 15,
            "characterOffsetEnd": 25,
            "before": " ",
            "after": "",
            "pos": "NNP",
            "ner": "STATE_OR_PROVINCE",
            "lemma": "Washington",
            "snippet": "George went to Washington, Rihanna is an artist",
            "entityDelimiter": "U"
          },
          ...
          {
            "index": 6,
            "word": "Rihanna",
            "characterOffsetBegin": 27,
            "characterOffsetEnd": 34,
            "before": " ",
            "after": " ",
            "pos": "NNP",
            "ner": "ARTIST",
            "lemma": "Rihanna",
            "mxmID": "33491890",
            "snippet": "George went to Washington, Rihanna is an artist",
            "entityDelimiter": "U"
          },
...
    ],

In this case we run this configuration of ner.fine.regexner.mapping":

       "ner.applyFineGrained": true,
        "ner.fine.regexner.mapping": "header=true,mxm_nlpdata/mxm_casedentities.tab;ignorecase=true,edu/stanford/nlp/models/kbp/regexner_caseless.tab;edu/stanford/nlp/models/kbp/regexner_cased.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_entities.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_artists.tab;mxm_nlpdata/mxm_labels.tab;ignorecase=true, mxm_nlpdata/mxm_blacklist.tab"  

So it seems that our custom SentenceAnnotator when overrides the annotate method it fails:

@Override
	public void annotate(Annotation annotation) {
		if (VERBOSE) {
			log.info("Adding NER Combiner annotation ... ");
		}

		// if ner.usePresentDateForDocDate is set, use the present date as the doc date
		if (usePresentDateForDocDate) {
			String currentDate =
					new SimpleDateFormat("yyyy-MM-dd").format(Calendar.getInstance().getTime());
			annotation.set(CoreAnnotations.DocDateAnnotation.class, currentDate);
		}
		// use provided doc date if applicable
		if (!providedDocDate.equals("")) {
			annotation.set(CoreAnnotations.DocDateAnnotation.class, providedDocDate);
		}
		
		
		
		AnnotationsMask mask = new AnnotationsMask(true);

		Annotation maskedAnnotation = mask.decompose(annotation);
		

		super.annotate(maskedAnnotation);
		this.ner.finalizeAnnotation(maskedAnnotation);

		if (VERBOSE) {
			log.info("done.");
		}
		// if Spanish, run the regexner with Spanish number rules
		if (LanguageInfo.HumanLanguage.SPANISH.equals(language))
			spanishNumberAnnotator.annotate(maskedAnnotation);
		// if fine grained ner is requested, run that
		if (this.applyFineGrained) {
			fineGrainedNERAnnotator.annotate(maskedAnnotation);
			// set the FineGrainedNamedEntityTagAnnotation.class
			for (CoreLabel token : maskedAnnotation.get(CoreAnnotations.TokensAnnotation.class)) {
				String fineGrainedTag = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
				token.set(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, fineGrainedTag);
			}
		}
		// if entity mentions should be built, run that
		if (this.buildEntityMentions)
			entityMentionsAnnotator.annotate(maskedAnnotation);
		
		
		Map<Class, Object> mapped_defaults = new HashMap<>();

		mapped_defaults.put(CoreAnnotations.NamedEntityTagAnnotation.class, "O");
		mapped_defaults.put(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class, null);
		mapped_defaults.put(MXMCoreAnnotations.MXMSlangCorrectionAnnotation.class, null);
		mapped_defaults.put(MXMCoreAnnotations.MXMEntityID.class, null);
		mapped_defaults.put(CoreAnnotations.LinkAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.ValueAnnotation.class, null);
		mapped_defaults.put(TimeExpression.Annotation.class, null);
		mapped_defaults.put(TimeExpression.TimeIndexAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.DistSimAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.NumericCompositeTypeAnnotation.class, null);
		mapped_defaults.put(TimeExpression.ChildrenAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.NumericTypeAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.ShapeAnnotation.class, null);
		mapped_defaults.put(Tags.TagsAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.NumerizedTokensAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.AnswerAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.NumericCompositeValueAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.CoarseNamedEntityTagAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, null);

		
		annotation = mask.recompose(annotation, maskedAnnotation, mapped_defaults);

	}

loretoparisi avatar Feb 11 '19 11:02 loretoparisi

Could you show me the pipeline settings? Did you create a statistical model to tag "ARTIST" ?

Also for reference, here is the latest write up on the NER process, which is pretty detailed about each step:

https://stanfordnlp.github.io/CoreNLP/ner.html

J38 avatar Feb 13 '19 09:02 J38

@J38 yes of course. My configuration looks like this

var options = {

        "lang": "en",
        
        "annotators": "tokenize,mxmssplit,mxmslang,mxmphonetics,mxmsegmenter,mxmpos,mxmlemma,mxmner,mxmsentiment",
        
        // POS
        "customAnnotatorClass.mxmpos": "musixmatch_nlp.MXMPartOfSpeechAnnotator",

        // LEMMATIZER
        "customAnnotatorClass.mxmlemma": "musixmatch_nlp.MXMMorphaAnnotator",

        // PHONEMES
        "customAnnotatorClass.mxmphonetics": "musixmatch_nlp.MXMPhoneticsAnnotator",

        // SEGMENTER
        "customAnnotatorClass.mxmsegmenter": "musixmatch_nlp.MXMLyricsSegmenterAnnotator",

        // SLANG
        "customAnnotatorClass.mxmslang": "musixmatch_nlp.MXMSlangCorrector",

        // NER
        "customAnnotatorClass.mxmner": "musixmatch_nlp.MXMNERCombinerAnnotator",

        // SPLIT
        "customAnnotatorClass.mxmssplit": "musixmatch_nlp.MXMWordToSentencesAnnotator",

        // SENTIMENT
        "customAnnotatorClass.mxmsentiment": "musixmatch_nlp.MXMSentimentTensorflowAnnotator",

        "mxmphonetics.ipa_dict": "/root/en_cmuipadict.txt",
        "mxmsentiment.model_dir": "/root/blstm_att1530026090",
        "mxmslang.language": "en",
        "ssplit.newlineIsSentenceBreak": "always",
        
        "ner.applyFineGrained": true,
        "ner.buildEntityMentions": false,
        
        "ner.fine.regexner.mapping": "header=true,mxm_nlpdata/mxm_casedentities.tab;ignorecase=true,edu/stanford/nlp/models/kbp/regexner_caseless.tab;edu/stanford/nlp/models/kbp/regexner_cased.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_entities.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_artists.tab;mxm_nlpdata/mxm_labels.tab;ignorecase=true, mxm_nlpdata/mxm_blacklist.tab"
        
        
    
    };

We have several class extensions here, while the important stuff here related to the NER classifier is the mxmner and its configuration "musixmatch_nlp.MXMNERCombinerAnnotator". You can find above the Java class that implements the MXMNERCombinerAnnotator that extends the SentenceAnnotator. Basically it normally works and tags the new ARTIST tag. It fails in the case presented above when having these multiple tokens.

loretoparisi avatar Feb 13 '19 17:02 loretoparisi

@J38 Any idea why this happens? This is my annotate override in the java annotator class

@Override
	public void annotate(Annotation annotation) {
		if (VERBOSE) {
			log.info("Adding NER Combiner annotation ... ");
		}

		// if ner.usePresentDateForDocDate is set, use the present date as the doc date
		if (usePresentDateForDocDate) {
			String currentDate =
					new SimpleDateFormat("yyyy-MM-dd").format(Calendar.getInstance().getTime());
			annotation.set(CoreAnnotations.DocDateAnnotation.class, currentDate);
		}
		// use provided doc date if applicable
		if (!providedDocDate.equals("")) {
			annotation.set(CoreAnnotations.DocDateAnnotation.class, providedDocDate);
		}
		
		
		
		AnnotationsMask mask = new AnnotationsMask(true);

		Annotation maskedAnnotation = mask.decompose(annotation);
		

		super.annotate(maskedAnnotation);
		this.ner.finalizeAnnotation(maskedAnnotation);

		if (VERBOSE) {
			log.info("done.");
		}
		// if Spanish, run the regexner with Spanish number rules
		if (LanguageInfo.HumanLanguage.SPANISH.equals(language))
			spanishNumberAnnotator.annotate(maskedAnnotation);
		// if fine grained ner is requested, run that
		if (this.applyFineGrained) {
			fineGrainedNERAnnotator.annotate(maskedAnnotation);
			// set the FineGrainedNamedEntityTagAnnotation.class
			for (CoreLabel token : maskedAnnotation.get(CoreAnnotations.TokensAnnotation.class)) {
				String fineGrainedTag = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
				token.set(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, fineGrainedTag);
			}
		}
		// if entity mentions should be built, run that
		if (this.buildEntityMentions)
			entityMentionsAnnotator.annotate(maskedAnnotation);
		
		
		Map<Class, Object> mapped_defaults = new HashMap<>();

		mapped_defaults.put(CoreAnnotations.NamedEntityTagAnnotation.class, "O");
		mapped_defaults.put(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class, null);
		mapped_defaults.put(MXMCoreAnnotations.MXMSlangCorrectionAnnotation.class, null);
		mapped_defaults.put(MXMCoreAnnotations.MXMEntityID.class, null);
		mapped_defaults.put(CoreAnnotations.LinkAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.ValueAnnotation.class, null);
		mapped_defaults.put(TimeExpression.Annotation.class, null);
		mapped_defaults.put(TimeExpression.TimeIndexAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.DistSimAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.NumericCompositeTypeAnnotation.class, null);
		mapped_defaults.put(TimeExpression.ChildrenAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.NumericTypeAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.ShapeAnnotation.class, null);
		mapped_defaults.put(Tags.TagsAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.NumerizedTokensAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.AnswerAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.NumericCompositeValueAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.CoarseNamedEntityTagAnnotation.class, null);
		mapped_defaults.put(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, null);

		
		annotation = mask.recompose(annotation, maskedAnnotation, mapped_defaults);

	}

loretoparisi avatar Feb 25 '19 14:02 loretoparisi