the location effects the ner result? or the ner analysis has some bugs?
ref the test result below, the named entity recognition result seems a little wired or some...
|text| "中国中央广播电台中央人民北京上海广州深圳美国人民"| |word|ner| |中国| ORGANIZATION| |中央| ORGANIZATION| |广播| ORGANIZATION| |电台| ORGANIZATION| |中央| FACILITY| |人民| FACILITY| |北京| FACILITY| |上海| FACILITY| |广州| FACILITY| |深圳| FACILITY| |美国| FACILITY| |人民| FACILITY|
|text: "CHINA,USA,U.S.A.,Ireland,JAPAN,Japan,Japanese,California"| |word|ner| |china|ORGANIZATION| |USA|O| |U.S.A.|O| |Ireland|O| |JAPAN|O| |Japan|O| |Japanese|O| |California|O|
Is this a long list of people and places? The NER expects context in a sentence, not a list of names, so it will not function well on an example like this.
On Fri, Jan 25, 2019 at 4:51 AM godlockin [email protected] wrote:
ref the test result below, the named entity recognition result seems a little wired or some...
text: "中国中央广播电台中央人民北京上海广州深圳美国人民" wordner 中国 ORGANIZATION 中央~ ORGANIZATION 广播~ ORGANIZATION 电台~ ORGANIZATION 中央~ FACILITY 人民~ FACILITY 北京~ FACILITY 上海~ FACILITY 广州~ FACILITY 深圳~ FACILITY 美国~ FACILITY 人民~ FACILITY
text: "CHINA,USA,U.S.A.,Ireland,JAPAN,Japan,Japanese,California" wordner chinaORGANIZATION USAO U.S.A.O IrelandO JAPANO JapanO JapaneseO California~O
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/825, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQMWSr5NVzAdgpEreq90ptk05duMAbeks5vGv3QgaJpZM4aS0uS .
actually, I've tried the whole sentence, but the most of the result which NER got is "O" and some other stuffs like [Jan.23.] has been separated into ["Jan."] and ["23."] and which were marked as "O"s as well
ref: text: General Mills announced a voluntary recall on Jan. 23. of their Gold Medal Unbleached Flour, due to the detection of Salmonella during routine sampling. The company is urging consumers to check their pantries for the presence of 5-pound bags of Gold Medal Unbleached Flour with a “better if used by date of April 20, 2020.” Hello World. I have $1.6 billion. Do you know the result of 2.3 * 1.5 .Sen.Lindsey O.Graham (R-S.C.) said that the "U.S. does not recognize the Maduro regime as the government of Venezuela"
{ "token": "General", "start_offset": 0, "end_offset": 7, "type": "{"lemma":"general","ner":"TITLE","tag":"NN"}", "position": 0 }, { "token": "Mills", "start_offset": 8, "end_offset": 13, "type": "{"lemma":"mills","ner":"O","tag":"NN"}", "position": 1 }, { "token": "announced", "start_offset": 14, "end_offset": 23, "type": "{"lemma":"announce","ner":"O","tag":"VV"}", "position": 2 }, { "token": "a", "start_offset": 24, "end_offset": 25, "type": "{"lemma":"a","ner":"NUMBER","tag":"CD"}", "position": 3 }, { "token": "voluntary", "start_offset": 26, "end_offset": 35, "type": "{"lemma":"voluntary","ner":"O","tag":"M"}", "position": 4 }, { "token": "recall", "start_offset": 36, "end_offset": 42, "type": "{"lemma":"recall","ner":"O","tag":"NN"}", "position": 5 }, { "token": "on", "start_offset": 43, "end_offset": 45, "type": "{"lemma":"on","ner":"O","tag":"P"}", "position": 6 }, { "token": "Jan.", "start_offset": 46, "end_offset": 50, "type": "{"lemma":"jan.","ner":"O","tag":"NR"}", "position": 7 }, { "token": "23.", "start_offset": 51, "end_offset": 54, "type": "{"lemma":"23.","ner":"O","tag":"NN"}", "position": 8 }, { "token": "of", "start_offset": 55, "end_offset": 57, "type": "{"lemma":"of","ner":"O","tag":"P"}", "position": 9 }, { "token": "their", "start_offset": 58, "end_offset": 63, "type": "{"lemma":"their","ner":"O","tag":"NN"}", "position": 10 }, { "token": "Gold", "start_offset": 64, "end_offset": 68, "type": "{"lemma":"gold","ner":"O","tag":"NN"}", "position": 11 }, { "token": "Medal", "start_offset": 69, "end_offset": 74, "type": "{"lemma":"medal","ner":"O","tag":"JJ"}", "position": 12 }, { "token": "Unbleached", "start_offset": 75, "end_offset": 85, "type": "{"lemma":"unbleached","ner":"O","tag":"NN"}", "position": 13 }, { "token": "Flour", "start_offset": 86, "end_offset": 91, "type": "{"lemma":"flour","ner":"O","tag":"NN"}", "position": 14 }, { "token": "due", "start_offset": 93, "end_offset": 96, "type": "{"lemma":"due","ner":"O","tag":"VV"}", "position": 15 }, { "token": "to", "start_offset": 97, "end_offset": 99, "type": "{"lemma":"to","ner":"O","tag":"P"}", "position": 16 }, { "token": "the", "start_offset": 100, "end_offset": 103, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 17 }, { "token": "detection", "start_offset": 104, "end_offset": 113, "type": "{"lemma":"detection","ner":"O","tag":"NN"}", "position": 18 }, { "token": "of", "start_offset": 114, "end_offset": 116, "type": "{"lemma":"of","ner":"O","tag":"P"}", "position": 19 }, { "token": "Salmonella", "start_offset": 117, "end_offset": 127, "type": "{"lemma":"salmonella","ner":"O","tag":"NR"}", "position": 20 }, { "token": "during", "start_offset": 128, "end_offset": 134, "type": "{"lemma":"during","ner":"O","tag":"JJ"}", "position": 21 }, { "token": "routine", "start_offset": 135, "end_offset": 142, "type": "{"lemma":"routine","ner":"O","tag":"NN"}", "position": 22 }, { "token": "sampling.", "start_offset": 143, "end_offset": 152, "type": "{"lemma":"sampling.","ner":"O","tag":"NN"}", "position": 23 }, { "token": "The", "start_offset": 153, "end_offset": 156, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 24 }, { "token": "company", "start_offset": 157, "end_offset": 164, "type": "{"lemma":"company","ner":"O","tag":"NN"}", "position": 25 }, { "token": "is", "start_offset": 165, "end_offset": 167, "type": "{"lemma":"be","ner":"O","tag":"VC"}", "position": 26 }, { "token": "urging", "start_offset": 168, "end_offset": 174, "type": "{"lemma":"urging","ner":"O","tag":"JJ"}", "position": 27 }, { "token": "consumers", "start_offset": 175, "end_offset": 184, "type": "{"lemma":"consumers","ner":"O","tag":"NN"}", "position": 28 }, { "token": "to", "start_offset": 185, "end_offset": 187, "type": "{"lemma":"to","ner":"O","tag":"P"}", "position": 29 }, { "token": "check", "start_offset": 188, "end_offset": 193, "type": "{"lemma":"check","ner":"O","tag":"NR"}", "position": 30 }, { "token": "their", "start_offset": 194, "end_offset": 199, "type": "{"lemma":"their","ner":"O","tag":"NN"}", "position": 31 }, { "token": "pantries", "start_offset": 200, "end_offset": 208, "type": "{"lemma":"pantries","ner":"O","tag":"NN"}", "position": 32 }, { "token": "for", "start_offset": 209, "end_offset": 212, "type": "{"lemma":"for","ner":"O","tag":"P"}", "position": 33 }, { "token": "the", "start_offset": 213, "end_offset": 216, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 34 }, { "token": "presence", "start_offset": 217, "end_offset": 225, "type": "{"lemma":"presence","ner":"O","tag":"NN"}", "position": 35 }, { "token": "of", "start_offset": 226, "end_offset": 228, "type": "{"lemma":"of","ner":"O","tag":"P"}", "position": 36 }, { "token": "5-pound", "start_offset": 229, "end_offset": 236, "type": "{"lemma":"5-pound","ner":"O","tag":"NN"}", "position": 37 }, { "token": "bags", "start_offset": 237, "end_offset": 241, "type": "{"lemma":"bags","ner":"O","tag":"NN"}", "position": 38 }, { "token": "of", "start_offset": 242, "end_offset": 244, "type": "{"lemma":"of","ner":"O","tag":"P"}", "position": 39 }, { "token": "Gold", "start_offset": 245, "end_offset": 249, "type": "{"lemma":"gold","ner":"O","tag":"NR"}", "position": 40 }, { "token": "Medal", "start_offset": 250, "end_offset": 255, "type": "{"lemma":"medal","ner":"O","tag":"JJ"}", "position": 41 }, { "token": "Unbleached", "start_offset": 256, "end_offset": 266, "type": "{"lemma":"unbleached","ner":"O","tag":"NN"}", "position": 42 }, { "token": "Flour", "start_offset": 267, "end_offset": 272, "type": "{"lemma":"flour","ner":"O","tag":"NN"}", "position": 43 }, { "token": "with", "start_offset": 273, "end_offset": 277, "type": "{"lemma":"with","ner":"O","tag":"P"}", "position": 44 }, { "token": "a", "start_offset": 278, "end_offset": 279, "type": "{"lemma":"a","ner":"NUMBER","tag":"CD"}", "position": 45 }, { "token": "better", "start_offset": 281, "end_offset": 287, "type": "{"lemma":"better","ner":"O","tag":"NN"}", "position": 46 }, { "token": "if", "start_offset": 288, "end_offset": 290, "type": "{"lemma":"if","ner":"O","tag":"NN"}", "position": 47 }, { "token": "used", "start_offset": 291, "end_offset": 295, "type": "{"lemma":"use","ner":"O","tag":"VV"}", "position": 48 }, { "token": "by", "start_offset": 296, "end_offset": 298, "type": "{"lemma":"by","ner":"O","tag":"P"}", "position": 49 }, { "token": "date", "start_offset": 299, "end_offset": 303, "type": "{"lemma":"date","ner":"MISC","tag":"NR"}", "position": 50 }, { "token": "of", "start_offset": 304, "end_offset": 306, "type": "{"lemma":"of","ner":"MISC","tag":"P"}", "position": 51 }, { "token": "April", "start_offset": 307, "end_offset": 312, "type": "{"lemma":"april","ner":"MISC","tag":"NR"}", "position": 52 }, { "token": "20", "start_offset": 313, "end_offset": 315, "type": "{"lemma":"20","ner":"NUMBER","tag":"CD"}", "position": 53 }, { "token": "2020", "start_offset": 317, "end_offset": 321, "type": "{"lemma":"2020","ner":"NUMBER","tag":"CD"}", "position": 54 }, { "token": ".", "start_offset": 321, "end_offset": 322, "type": "{"lemma":".","ner":"O","tag":"PU"}", "position": 55 }, { "token": "Hello", "start_offset": 324, "end_offset": 329, "type": "{"lemma":"hello","ner":"O","tag":"NR"}", "position": 56 }, { "token": "World", "start_offset": 330, "end_offset": 335, "type": "{"lemma":"world","ner":"O","tag":"NR"}", "position": 57 }, { "token": ".", "start_offset": 335, "end_offset": 336, "type": "{"lemma":".","ner":"O","tag":"PU"}", "position": 58 }, { "token": "I", "start_offset": 337, "end_offset": 338, "type": "{"lemma":"I","ner":"O","tag":"PN"}", "position": 59 }, { "token": "have", "start_offset": 339, "end_offset": 343, "type": "{"lemma":"have","ner":"O","tag":"VV"}", "position": 60 }, { "token": "$1.6", "start_offset": 344, "end_offset": 348, "type": "{"lemma":"$1.6","ner":"O","tag":"JJ"}", "position": 61 }, { "token": "billion", "start_offset": 349, "end_offset": 356, "type": "{"lemma":"billion","ner":"O","tag":"NN"}", "position": 62 }, { "token": ".", "start_offset": 356, "end_offset": 357, "type": "{"lemma":".","ner":"O","tag":"PU"}", "position": 63 }, { "token": "Do", "start_offset": 358, "end_offset": 360, "type": "{"lemma":"do","ner":"O","tag":"NN"}", "position": 64 }, { "token": "you", "start_offset": 361, "end_offset": 364, "type": "{"lemma":"you","ner":"O","tag":"PN"}", "position": 65 }, { "token": "know", "start_offset": 365, "end_offset": 369, "type": "{"lemma":"know","ner":"O","tag":"VV"}", "position": 66 }, { "token": "the", "start_offset": 370, "end_offset": 373, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 67 }, { "token": "result", "start_offset": 374, "end_offset": 380, "type": "{"lemma":"result","ner":"O","tag":"NN"}", "position": 68 }, { "token": "of", "start_offset": 381, "end_offset": 383, "type": "{"lemma":"of","ner":"O","tag":"P"}", "position": 69 }, { "token": "2.3", "start_offset": 384, "end_offset": 387, "type": "{"lemma":"2.3","ner":"NUMBER","tag":"CD"}", "position": 70 }, { "token": "1.5", "start_offset": 390, "end_offset": 393, "type": "{"lemma":"1.5","ner":"NUMBER","tag":"CD"}", "position": 71 }, { "token": ".Sen.Lindsey", "start_offset": 394, "end_offset": 406, "type": "{"lemma":".sen.lindsey","ner":"O","tag":"M"}", "position": 72 }, { "token": "O.Graham", "start_offset": 407, "end_offset": 415, "type": "{"lemma":"o.graham","ner":"O","tag":"NN"}", "position": 73 }, { "token": "R-S.C.", "start_offset": 417, "end_offset": 423, "type": "{"lemma":"r-s.c.","ner":"O","tag":"NR"}", "position": 74 }, { "token": "said", "start_offset": 425, "end_offset": 429, "type": "{"lemma":"said","ner":"O","tag":"NN"}", "position": 75 }, { "token": "that", "start_offset": 430, "end_offset": 434, "type": "{"lemma":"that","ner":"O","tag":"CS"}", "position": 76 }, { "token": "the", "start_offset": 435, "end_offset": 438, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 77 }, { "token": "U.S.", "start_offset": 440, "end_offset": 444, "type": "{"lemma":"u.s.","ner":"COUNTRY","tag":"JJ"}", "position": 78 }, { "token": "does", "start_offset": 445, "end_offset": 449, "type": "{"lemma":"does","ner":"O","tag":"NN"}", "position": 79 }, { "token": "not", "start_offset": 450, "end_offset": 453, "type": "{"lemma":"not","ner":"O","tag":"AD"}", "position": 80 }, { "token": "recognize", "start_offset": 454, "end_offset": 463, "type": "{"lemma":"recognize","ner":"O","tag":"VV"}", "position": 81 }, { "token": "the", "start_offset": 464, "end_offset": 467, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 82 }, { "token": "Maduro", "start_offset": 468, "end_offset": 474, "type": "{"lemma":"maduro","ner":"O","tag":"JJ"}", "position": 83 }, { "token": "regime", "start_offset": 475, "end_offset": 481, "type": "{"lemma":"regime","ner":"O","tag":"NN"}", "position": 84 }, { "token": "as", "start_offset": 482, "end_offset": 484, "type": "{"lemma":"as","ner":"O","tag":"NN"}", "position": 85 }, { "token": "the", "start_offset": 485, "end_offset": 488, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 86 }, { "token": "government", "start_offset": 489, "end_offset": 499, "type": "{"lemma":"government","ner":"O","tag":"NN"}", "position": 87 }, { "token": "of", "start_offset": 500, "end_offset": 502, "type": "{"lemma":"of","ner":"O","tag":"P"}", "position": 88 }, { "token": "Venezuela", "start_offset": 503, "end_offset": 512, "type": "{"lemma":"venezuela","ner":"O","tag":"NR"}", "position": 89 }
also text: 想象一下,有那么一天,半夜三点,刚和兄弟喝完酒回到家,拿出钥匙捅进钥匙孔,可怎么也打不开. 你刚说了一句:咋打不开泥?我们的门就说话了: "轻一点嘛,人家好痛哦......"或是"我操哥们儿你怎么才回来啊?你媳妇儿把门锁了.你丫今儿晚上睡马路吧"。多好啊,世界大同了.
{ "token": "想象", "start_offset": 0, "end_offset": 2, "type": "{"lemma":"想象","ner":"O","tag":"VV"}", "position": 0 }, { "token": "一下", "start_offset": 2, "end_offset": 4, "type": "{"lemma":"一下","ner":"O","tag":"AD"}", "position": 1 }, { "token": "有", "start_offset": 5, "end_offset": 6, "type": "{"lemma":"有","ner":"O","tag":"VE"}", "position": 2 }, { "token": "那么", "start_offset": 6, "end_offset": 8, "type": "{"lemma":"那么","ner":"O","tag":"JJ"}", "position": 3 }, { "token": "一", "start_offset": 8, "end_offset": 9, "type": "{"lemma":"一","ner":"NUMBER","tag":"CD"}", "position": 4 }, { "token": "天", "start_offset": 9, "end_offset": 10, "type": "{"lemma":"天","ner":"MISC","tag":"M"}", "position": 5 }, { "token": "半夜", "start_offset": 11, "end_offset": 13, "type": "{"lemma":"半夜","ner":"TIME","tag":"NT"}", "position": 6 }, { "token": "三点", "start_offset": 13, "end_offset": 15, "type": "{"lemma":"三点","ner":"TIME","tag":"NT"}", "position": 7 }, { "token": "刚", "start_offset": 16, "end_offset": 17, "type": "{"lemma":"刚","ner":"O","tag":"AD"}", "position": 8 }, { "token": "和", "start_offset": 17, "end_offset": 18, "type": "{"lemma":"和","ner":"O","tag":"P"}", "position": 9 }, { "token": "兄弟", "start_offset": 18, "end_offset": 20, "type": "{"lemma":"兄弟","ner":"O","tag":"NN"}", "position": 10 }, { "token": "喝完", "start_offset": 20, "end_offset": 22, "type": "{"lemma":"喝完","ner":"O","tag":"VV"}", "position": 11 }, { "token": "酒", "start_offset": 22, "end_offset": 23, "type": "{"lemma":"酒","ner":"O","tag":"NN"}", "position": 12 }, { "token": "回到", "start_offset": 23, "end_offset": 25, "type": "{"lemma":"回到","ner":"O","tag":"VV"}", "position": 13 }, { "token": "家", "start_offset": 25, "end_offset": 26, "type": "{"lemma":"家","ner":"O","tag":"NN"}", "position": 14 }, { "token": "拿出", "start_offset": 27, "end_offset": 29, "type": "{"lemma":"拿出","ner":"O","tag":"VV"}", "position": 15 }, { "token": "钥匙", "start_offset": 29, "end_offset": 31, "type": "{"lemma":"钥匙","ner":"O","tag":"NN"}", "position": 16 }, { "token": "捅", "start_offset": 31, "end_offset": 32, "type": "{"lemma":"捅","ner":"O","tag":"VV"}", "position": 17 }, { "token": "进", "start_offset": 32, "end_offset": 33, "type": "{"lemma":"进","ner":"O","tag":"VV"}", "position": 18 }, { "token": "钥匙孔", "start_offset": 33, "end_offset": 36, "type": "{"lemma":"钥匙孔","ner":"O","tag":"NN"}", "position": 19 }, { "token": "可", "start_offset": 37, "end_offset": 38, "type": "{"lemma":"可","ner":"O","tag":"AD"}", "position": 20 }, { "token": "怎么", "start_offset": 38, "end_offset": 40, "type": "{"lemma":"怎么","ner":"O","tag":"AD"}", "position": 21 }, { "token": "也", "start_offset": 40, "end_offset": 41, "type": "{"lemma":"也","ner":"O","tag":"AD"}", "position": 22 }, { "token": "打", "start_offset": 41, "end_offset": 42, "type": "{"lemma":"打","ner":"O","tag":"VV"}", "position": 23 }, { "token": "不", "start_offset": 42, "end_offset": 43, "type": "{"lemma":"不","ner":"O","tag":"AD"}", "position": 24 }, { "token": "开", "start_offset": 43, "end_offset": 44, "type": "{"lemma":"开","ner":"O","tag":"VV"}", "position": 25 }, { "token": ".", "start_offset": 44, "end_offset": 45, "type": "{"lemma":".","ner":"O","tag":"PU"}", "position": 26 }, { "token": "你", "start_offset": 46, "end_offset": 47, "type": "{"lemma":"你","ner":"O","tag":"PN"}", "position": 27 }, { "token": "刚", "start_offset": 47, "end_offset": 48, "type": "{"lemma":"刚","ner":"O","tag":"AD"}", "position": 28 }, { "token": "说", "start_offset": 48, "end_offset": 49, "type": "{"lemma":"说","ner":"O","tag":"VV"}", "position": 29 }, { "token": "了", "start_offset": 49, "end_offset": 50, "type": "{"lemma":"了","ner":"O","tag":"AS"}", "position": 30 }, { "token": "一", "start_offset": 50, "end_offset": 51, "type": "{"lemma":"一","ner":"NUMBER","tag":"CD"}", "position": 31 }, { "token": "句", "start_offset": 51, "end_offset": 52, "type": "{"lemma":"句","ner":"O","tag":"M"}", "position": 32 }, { "token": ":", "start_offset": 52, "end_offset": 53, "type": "{"lemma":":","ner":"O","tag":"PU"}", "position": 33 }, { "token": "咋打", "start_offset": 53, "end_offset": 55, "type": "{"lemma":"咋打","ner":"O","tag":"NN"}", "position": 34 }, { "token": "不", "start_offset": 55, "end_offset": 56, "type": "{"lemma":"不","ner":"O","tag":"AD"}", "position": 35 }, { "token": "开泥", "start_offset": 56, "end_offset": 58, "type": "{"lemma":"开泥","ner":"O","tag":"VV"}", "position": 36 }, { "token": "我们", "start_offset": 59, "end_offset": 61, "type": "{"lemma":"我们","ner":"O","tag":"PN"}", "position": 37 }, { "token": "的", "start_offset": 61, "end_offset": 62, "type": "{"lemma":"的","ner":"O","tag":"DEG"}", "position": 38 }, { "token": "门", "start_offset": 62, "end_offset": 63, "type": "{"lemma":"门","ner":"O","tag":"NN"}", "position": 39 }, { "token": "就", "start_offset": 63, "end_offset": 64, "type": "{"lemma":"就","ner":"O","tag":"AD"}", "position": 40 }, { "token": "说话", "start_offset": 64, "end_offset": 66, "type": "{"lemma":"说话","ner":"O","tag":"VV"}", "position": 41 }, { "token": "了", "start_offset": 66, "end_offset": 67, "type": "{"lemma":"了","ner":"O","tag":"AS"}", "position": 42 }, { "token": ":", "start_offset": 67, "end_offset": 68, "type": "{"lemma":":","ner":"O","tag":"PU"}", "position": 43 }, { "token": "轻", "start_offset": 70, "end_offset": 71, "type": "{"lemma":"轻","ner":"O","tag":"JJ"}", "position": 44 }, { "token": "一点", "start_offset": 71, "end_offset": 73, "type": "{"lemma":"一点","ner":"NUMBER","tag":"CD"}", "position": 45 }, { "token": "嘛", "start_offset": 73, "end_offset": 74, "type": "{"lemma":"嘛","ner":"O","tag":"SP"}", "position": 46 }, { "token": "人家", "start_offset": 75, "end_offset": 77, "type": "{"lemma":"人家","ner":"O","tag":"NN"}", "position": 47 }, { "token": "好", "start_offset": 77, "end_offset": 78, "type": "{"lemma":"好","ner":"O","tag":"AD"}", "position": 48 }, { "token": "痛", "start_offset": 78, "end_offset": 79, "type": "{"lemma":"痛","ner":"O","tag":"VA"}", "position": 49 }, { "token": "哦", "start_offset": 79, "end_offset": 80, "type": "{"lemma":"哦","ner":"O","tag":"SP"}", "position": 50 }, { "token": ".", "start_offset": 80, "end_offset": 81, "type": "{"lemma":".","ner":"O","tag":"PU"}", "position": 51 }, { "token": ".....", "start_offset": 81, "end_offset": 86, "type": "{"lemma":".....","ner":"O","tag":"PU"}", "position": 52 }, { "token": "或是", "start_offset": 87, "end_offset": 89, "type": "{"lemma":"或是","ner":"O","tag":"CC"}", "position": 53 }, { "token": "我", "start_offset": 90, "end_offset": 91, "type": "{"lemma":"我","ner":"O","tag":"PN"}", "position": 54 }, { "token": "操", "start_offset": 91, "end_offset": 92, "type": "{"lemma":"操","ner":"O","tag":"VV"}", "position": 55 }, { "token": "哥们儿", "start_offset": 92, "end_offset": 95, "type": "{"lemma":"哥们儿","ner":"O","tag":"NN"}", "position": 56 }, { "token": "你", "start_offset": 95, "end_offset": 96, "type": "{"lemma":"你","ner":"O","tag":"PN"}", "position": 57 }, { "token": "怎么", "start_offset": 96, "end_offset": 98, "type": "{"lemma":"怎么","ner":"O","tag":"AD"}", "position": 58 }, { "token": "才", "start_offset": 98, "end_offset": 99, "type": "{"lemma":"才","ner":"O","tag":"AD"}", "position": 59 }, { "token": "回来", "start_offset": 99, "end_offset": 101, "type": "{"lemma":"回来","ner":"O","tag":"VV"}", "position": 60 }, { "token": "啊", "start_offset": 101, "end_offset": 102, "type": "{"lemma":"啊","ner":"O","tag":"SP"}", "position": 61 }, { "token": "你", "start_offset": 103, "end_offset": 104, "type": "{"lemma":"你","ner":"O","tag":"PN"}", "position": 62 }, { "token": "媳妇儿", "start_offset": 104, "end_offset": 107, "type": "{"lemma":"媳妇儿","ner":"O","tag":"AD"}", "position": 63 }, { "token": "把", "start_offset": 107, "end_offset": 108, "type": "{"lemma":"把","ner":"O","tag":"BA"}", "position": 64 }, { "token": "门锁", "start_offset": 108, "end_offset": 110, "type": "{"lemma":"门锁","ner":"O","tag":"NN"}", "position": 65 }, { "token": "了", "start_offset": 110, "end_offset": 111, "type": "{"lemma":"了","ner":"O","tag":"SP"}", "position": 66 }, { "token": ".", "start_offset": 111, "end_offset": 112, "type": "{"lemma":".","ner":"O","tag":"PU"}", "position": 67 }, { "token": "你", "start_offset": 112, "end_offset": 113, "type": "{"lemma":"你","ner":"O","tag":"PN"}", "position": 68 }, { "token": "丫今儿", "start_offset": 113, "end_offset": 116, "type": "{"lemma":"丫今儿","ner":"O","tag":"NN"}", "position": 69 }, { "token": "晚上", "start_offset": 116, "end_offset": 118, "type": "{"lemma":"晚上","ner":"TIME","tag":"NT"}", "position": 70 }, { "token": "睡", "start_offset": 118, "end_offset": 119, "type": "{"lemma":"睡","ner":"O","tag":"VV"}", "position": 71 }, { "token": "马路", "start_offset": 119, "end_offset": 121, "type": "{"lemma":"马路","ner":"O","tag":"NN"}", "position": 72 }, { "token": "吧", "start_offset": 121, "end_offset": 122, "type": "{"lemma":"吧","ner":"O","tag":"SP"}", "position": 73 }, { "token": "多", "start_offset": 124, "end_offset": 125, "type": "{"lemma":"多","ner":"O","tag":"AD"}", "position": 74 }, { "token": "好啊", "start_offset": 125, "end_offset": 127, "type": "{"lemma":"好啊","ner":"O","tag":"VA"}", "position": 75 }, { "token": "世界", "start_offset": 128, "end_offset": 130, "type": "{"lemma":"世界","ner":"O","tag":"NN"}", "position": 76 }, { "token": "大同", "start_offset": 130, "end_offset": 132, "type": "{"lemma":"大同","ner":"O","tag":"NR"}", "position": 77 }, { "token": "了", "start_offset": 132, "end_offset": 133, "type": "{"lemma":"了","ner":"O","tag":"SP"}", "position": 78 }, { "token": ".", "start_offset": 133, "end_offset": 134, "type": "{"lemma":".","ner":"O","tag":"PU"}", "position": 79 }
I guess it should be able to recognize General Mills but doesn't. Maybe it's not in the training data.
I also notice that your periods are being tokenized with the words that come before them, which definitely throws things off. Are you using the PTB tokenizer?
On Fri, Jan 25, 2019 at 7:11 PM godlockin [email protected] wrote:
actually, I've tried the whole sentence, but the most of the result which NER got is "O" and some other stuffs like [Jan.23.] has been separated into ["Jan."] and ["23."] and which were marked as "O"s as well
ref: text: General Mills announced a voluntary recall on Jan. 23. of their Gold Medal Unbleached Flour, due to the detection of Salmonella during routine sampling. The company is urging consumers to check their pantries for the presence of 5-pound bags of Gold Medal Unbleached Flour with a “better if used by date of April 20, 2020.” Hello World. I have $1.6 billion. Do you know the result of 2.3 * 1.5 .Sen.Lindsey O.Graham (R-S.C.) said that the "U.S. does not recognize the Maduro regime as the government of Venezuela"
{ "token": "General", "start_offset": 0, "end_offset": 7, "type": "{"lemma":"general","ner":"TITLE","tag":"NN"}", "position": 0 }, { "token": "Mills", "start_offset": 8, "end_offset": 13, "type": "{"lemma":"mills","ner":"O","tag":"NN"}", "position": 1 }, { "token": "announced", "start_offset": 14, "end_offset": 23, "type": "{"lemma":"announce","ner":"O","tag":"VV"}", "position": 2 }, { "token": "a", "start_offset": 24, "end_offset": 25, "type": "{"lemma":"a","ner":"NUMBER","tag":"CD"}", "position": 3 }, { "token": "voluntary", "start_offset": 26, "end_offset": 35, "type": "{"lemma":"voluntary","ner":"O","tag":"M"}", "position": 4 }, { "token": "recall", "start_offset": 36, "end_offset": 42, "type": "{"lemma":"recall","ner":"O","tag":"NN"}", "position": 5 }, { "token": "on", "start_offset": 43, "end_offset": 45, "type": "{"lemma":"on","ner":"O","tag":"P"}", "position": 6 }, { "token": "Jan.", "start_offset": 46, "end_offset": 50, "type": "{"lemma":"jan.","ner":"O","tag":"NR"}", "position": 7 }, { "token": "23.", "start_offset": 51, "end_offset": 54, "type": "{"lemma":"23.","ner":"O","tag":"NN"}", "position": 8 }, { "token": "of", "start_offset": 55, "end_offset": 57, "type": "{"lemma":"of","ner":"O","tag":"P"}", "position": 9 }, { "token": "their", "start_offset": 58, "end_offset": 63, "type": "{"lemma":"their","ner":"O","tag":"NN"}", "position": 10 }, { "token": "Gold", "start_offset": 64, "end_offset": 68, "type": "{"lemma":"gold","ner":"O","tag":"NN"}", "position": 11 }, { "token": "Medal", "start_offset": 69, "end_offset": 74, "type": "{"lemma":"medal","ner":"O","tag":"JJ"}", "position": 12 }, { "token": "Unbleached", "start_offset": 75, "end_offset": 85, "type": "{"lemma":"unbleached","ner":"O","tag":"NN"}", "position": 13 }, { "token": "Flour", "start_offset": 86, "end_offset": 91, "type": "{"lemma":"flour","ner":"O","tag":"NN"}", "position": 14 }, { "token": "due", "start_offset": 93, "end_offset": 96, "type": "{"lemma":"due","ner":"O","tag":"VV"}", "position": 15 }, { "token": "to", "start_offset": 97, "end_offset": 99, "type": "{"lemma":"to","ner":"O","tag":"P"}", "position": 16 }, { "token": "the", "start_offset": 100, "end_offset": 103, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 17 }, { "token": "detection", "start_offset": 104, "end_offset": 113, "type": "{"lemma":"detection","ner":"O","tag":"NN"}", "position": 18 }, { "token": "of", "start_offset": 114, "end_offset": 116, "type": "{"lemma":"of","ner":"O","tag":"P"}", "position": 19 }, { "token": "Salmonella", "start_offset": 117, "end_offset": 127, "type": "{"lemma":"salmonella","ner":"O","tag":"NR"}", "position": 20 }, { "token": "during", "start_offset": 128, "end_offset": 134, "type": "{"lemma":"during","ner":"O","tag":"JJ"}", "position": 21 }, { "token": "routine", "start_offset": 135, "end_offset": 142, "type": "{"lemma":"routine","ner":"O","tag":"NN"}", "position": 22 }, { "token": "sampling.", "start_offset": 143, "end_offset": 152, "type": "{"lemma":"sampling.","ner":"O","tag":"NN"}", "position": 23 }, { "token": "The", "start_offset": 153, "end_offset": 156, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 24 }, { "token": "company", "start_offset": 157, "end_offset": 164, "type": "{"lemma":"company","ner":"O","tag":"NN"}", "position": 25 }, { "token": "is", "start_offset": 165, "end_offset": 167, "type": "{"lemma":"be","ner":"O","tag":"VC"}", "position": 26 }, { "token": "urging", "start_offset": 168, "end_offset": 174, "type": "{"lemma":"urging","ner":"O","tag":"JJ"}", "position": 27 }, { "token": "consumers", "start_offset": 175, "end_offset": 184, "type": "{"lemma":"consumers","ner":"O","tag":"NN"}", "position": 28 }, { "token": "to", "start_offset": 185, "end_offset": 187, "type": "{"lemma":"to","ner":"O","tag":"P"}", "position": 29 }, { "token": "check", "start_offset": 188, "end_offset": 193, "type": "{"lemma":"check","ner":"O","tag":"NR"}", "position": 30 }, { "token": "their", "start_offset": 194, "end_offset": 199, "type": "{"lemma":"their","ner":"O","tag":"NN"}", "position": 31 }, { "token": "pantries", "start_offset": 200, "end_offset": 208, "type": "{"lemma":"pantries","ner":"O","tag":"NN"}", "position": 32 }, { "token": "for", "start_offset": 209, "end_offset": 212, "type": "{"lemma":"for","ner":"O","tag":"P"}", "position": 33 }, { "token": "the", "start_offset": 213, "end_offset": 216, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 34 }, { "token": "presence", "start_offset": 217, "end_offset": 225, "type": "{"lemma":"presence","ner":"O","tag":"NN"}", "position": 35 }, { "token": "of", "start_offset": 226, "end_offset": 228, "type": "{"lemma":"of","ner":"O","tag":"P"}", "position": 36 }, { "token": "5-pound", "start_offset": 229, "end_offset": 236, "type": "{"lemma":"5-pound","ner":"O","tag":"NN"}", "position": 37 }, { "token": "bags", "start_offset": 237, "end_offset": 241, "type": "{"lemma":"bags","ner":"O","tag":"NN"}", "position": 38 }, { "token": "of", "start_offset": 242, "end_offset": 244, "type": "{"lemma":"of","ner":"O","tag":"P"}", "position": 39 }, { "token": "Gold", "start_offset": 245, "end_offset": 249, "type": "{"lemma":"gold","ner":"O","tag":"NR"}", "position": 40 }, { "token": "Medal", "start_offset": 250, "end_offset": 255, "type": "{"lemma":"medal","ner":"O","tag":"JJ"}", "position": 41 }, { "token": "Unbleached", "start_offset": 256, "end_offset": 266, "type": "{"lemma":"unbleached","ner":"O","tag":"NN"}", "position": 42 }, { "token": "Flour", "start_offset": 267, "end_offset": 272, "type": "{"lemma":"flour","ner":"O","tag":"NN"}", "position": 43 }, { "token": "with", "start_offset": 273, "end_offset": 277, "type": "{"lemma":"with","ner":"O","tag":"P"}", "position": 44 }, { "token": "a", "start_offset": 278, "end_offset": 279, "type": "{"lemma":"a","ner":"NUMBER","tag":"CD"}", "position": 45 }, { "token": "better", "start_offset": 281, "end_offset": 287, "type": "{"lemma":"better","ner":"O","tag":"NN"}", "position": 46 }, { "token": "if", "start_offset": 288, "end_offset": 290, "type": "{"lemma":"if","ner":"O","tag":"NN"}", "position": 47 }, { "token": "used", "start_offset": 291, "end_offset": 295, "type": "{"lemma":"use","ner":"O","tag":"VV"}", "position": 48 }, { "token": "by", "start_offset": 296, "end_offset": 298, "type": "{"lemma":"by","ner":"O","tag":"P"}", "position": 49 }, { "token": "date", "start_offset": 299, "end_offset": 303, "type": "{"lemma":"date","ner":"MISC","tag":"NR"}", "position": 50 }, { "token": "of", "start_offset": 304, "end_offset": 306, "type": "{"lemma":"of","ner":"MISC","tag":"P"}", "position": 51 }, { "token": "April", "start_offset": 307, "end_offset": 312, "type": "{"lemma":"april","ner":"MISC","tag":"NR"}", "position": 52 }, { "token": "20", "start_offset": 313, "end_offset": 315, "type": "{"lemma":"20","ner":"NUMBER","tag":"CD"}", "position": 53 }, { "token": "2020", "start_offset": 317, "end_offset": 321, "type": "{"lemma":"2020","ner":"NUMBER","tag":"CD"}", "position": 54 }, { "token": ".", "start_offset": 321, "end_offset": 322, "type": "{"lemma":".","ner":"O","tag":"PU"}", "position": 55 }, { "token": "Hello", "start_offset": 324, "end_offset": 329, "type": "{"lemma":"hello","ner":"O","tag":"NR"}", "position": 56 }, { "token": "World", "start_offset": 330, "end_offset": 335, "type": "{"lemma":"world","ner":"O","tag":"NR"}", "position": 57 }, { "token": ".", "start_offset": 335, "end_offset": 336, "type": "{"lemma":".","ner":"O","tag":"PU"}", "position": 58 }, { "token": "I", "start_offset": 337, "end_offset": 338, "type": "{"lemma":"I","ner":"O","tag":"PN"}", "position": 59 }, { "token": "have", "start_offset": 339, "end_offset": 343, "type": "{"lemma":"have","ner":"O","tag":"VV"}", "position": 60 }, { "token": "$1.6", "start_offset": 344, "end_offset": 348, "type": "{"lemma":"$1.6","ner":"O","tag":"JJ"}", "position": 61 }, { "token": "billion", "start_offset": 349, "end_offset": 356, "type": "{"lemma":"billion","ner":"O","tag":"NN"}", "position": 62 }, { "token": ".", "start_offset": 356, "end_offset": 357, "type": "{"lemma":".","ner":"O","tag":"PU"}", "position": 63 }, { "token": "Do", "start_offset": 358, "end_offset": 360, "type": "{"lemma":"do","ner":"O","tag":"NN"}", "position": 64 }, { "token": "you", "start_offset": 361, "end_offset": 364, "type": "{"lemma":"you","ner":"O","tag":"PN"}", "position": 65 }, { "token": "know", "start_offset": 365, "end_offset": 369, "type": "{"lemma":"know","ner":"O","tag":"VV"}", "position": 66 }, { "token": "the", "start_offset": 370, "end_offset": 373, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 67 }, { "token": "result", "start_offset": 374, "end_offset": 380, "type": "{"lemma":"result","ner":"O","tag":"NN"}", "position": 68 }, { "token": "of", "start_offset": 381, "end_offset": 383, "type": "{"lemma":"of","ner":"O","tag":"P"}", "position": 69 }, { "token": "2.3", "start_offset": 384, "end_offset": 387, "type": "{"lemma":"2.3","ner":"NUMBER","tag":"CD"}", "position": 70 }, { "token": "1.5", "start_offset": 390, "end_offset": 393, "type": "{"lemma":"1.5","ner":"NUMBER","tag":"CD"}", "position": 71 }, { "token": ".Sen.Lindsey", "start_offset": 394, "end_offset": 406, "type": "{"lemma":".sen.lindsey","ner":"O","tag":"M"}", "position": 72 }, { "token": "O.Graham", "start_offset": 407, "end_offset": 415, "type": "{"lemma":"o.graham","ner":"O","tag":"NN"}", "position": 73 }, { "token": "R-S.C.", "start_offset": 417, "end_offset": 423, "type": "{"lemma":"r-s.c.","ner":"O","tag":"NR"}", "position": 74 }, { "token": "said", "start_offset": 425, "end_offset": 429, "type": "{"lemma":"said","ner":"O","tag":"NN"}", "position": 75 }, { "token": "that", "start_offset": 430, "end_offset": 434, "type": "{"lemma":"that","ner":"O","tag":"CS"}", "position": 76 }, { "token": "the", "start_offset": 435, "end_offset": 438, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 77 }, { "token": "U.S.", "start_offset": 440, "end_offset": 444, "type": "{"lemma":"u.s.","ner":"COUNTRY","tag":"JJ"}", "position": 78 }, { "token": "does", "start_offset": 445, "end_offset": 449, "type": "{"lemma":"does","ner":"O","tag":"NN"}", "position": 79 }, { "token": "not", "start_offset": 450, "end_offset": 453, "type": "{"lemma":"not","ner":"O","tag":"AD"}", "position": 80 }, { "token": "recognize", "start_offset": 454, "end_offset": 463, "type": "{"lemma":"recognize","ner":"O","tag":"VV"}", "position": 81 }, { "token": "the", "start_offset": 464, "end_offset": 467, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 82 }, { "token": "Maduro", "start_offset": 468, "end_offset": 474, "type": "{"lemma":"maduro","ner":"O","tag":"JJ"}", "position": 83 }, { "token": "regime", "start_offset": 475, "end_offset": 481, "type": "{"lemma":"regime","ner":"O","tag":"NN"}", "position": 84 }, { "token": "as", "start_offset": 482, "end_offset": 484, "type": "{"lemma":"as","ner":"O","tag":"NN"}", "position": 85 }, { "token": "the", "start_offset": 485, "end_offset": 488, "type": "{"lemma":"the","ner":"O","tag":"DT"}", "position": 86 }, { "token": "government", "start_offset": 489, "end_offset": 499, "type": "{"lemma":"government","ner":"O","tag":"NN"}", "position": 87 }, { "token": "of", "start_offset": 500, "end_offset": 502, "type": "{"lemma":"of","ner":"O","tag":"P"}", "position": 88 }, { "token": "Venezuela", "start_offset": 503, "end_offset": 512, "type": "{"lemma":"venezuela","ner":"O","tag":"NR"}", "position": 89 }
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/825#issuecomment-457796275, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQMWSED3_5qoVMRa4vNaY6gX1P2iYd2ks5vG8drgaJpZM4aS0uS .
emmm, actually i just involved the stanford-core-nlp 3.9.2 with the same version of models and Chinese models of course, so for the tokenizer, I'm not quite sure which the project picked
some sample code: ` Properties props = new Properties(); props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner"); props.setProperty("tokenize.language", "zh");
props.setProperty("ssplit.boundaryTokenRegex", "[.。]|[!?!?]+");
props.setProperty("segment.model", "edu/stanford/nlp/models/segmenter/chinese/ctb.gz"); props.setProperty("segment.sighanCorporaDict", "edu/stanford/nlp/models/segmenter/chinese"); props.setProperty("segment.serDictionary", "edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz"); props.setProperty("segment.sighanPostProcessing", "true"); props.setProperty("pos.model", "edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger"); props.setProperty("parse.model", "edu/stanford/nlp/models/srparser/chineseSR.ser.gz"); props.setProperty("depparse.model ", "edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz"); props.setProperty("depparse.language", "chinese");
props.setProperty("ner.language", "chinese"); props.setProperty("ner.model", "edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz"); props.setProperty("ner.applyNumericClassifiers", "true"); props.setProperty("ner.useSUTime", "false");
props.setProperty("regexner.mapping", "edu/stanford/nlp/models/kbp/cn_regexner_mapping.tab"); props.setProperty("regexner.validpospattern", "^(NR|NN|JJ).*"); props.setProperty("regexner.ignorecase", "true"); props.setProperty("regexner.noDefaultOverwriteLabels", "CITY"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props);`
and the functional part
CoreDocument document = new CoreDocument(fullStr); pipeline.annotate(document); document.sentences().stream().map(sentence -> sentence.tokens().stream().filter(token -> { String word = token.get(CoreAnnotations.TextAnnotation.class); return !(isBlank(word) || ignoreSymbols.contains(word)); }).collect(Collectors.toList())).reduce(words, (list, item) -> { list.addAll(item); return list; });
then I just iterated the list and got the result json before