analysis-pinyin icon indicating copy to clipboard operation
analysis-pinyin copied to clipboard

人名拼音英文分词不准确

Open Radiums opened this issue 6 years ago • 5 comments

分词 { "tokenizer":"pinyin", "text":"xuning" }

结果 { "tokens": [ { "token": "xun", "start_offset": 0, "end_offset": 5, "type": "word", "position": 0 }, { "token": "ing", "start_offset": 0, "end_offset": 5, "type": "word", "position": 1 }, { "token": "xuning", "start_offset": 0, "end_offset": 6, "type": "word", "position": 2 } ] }

预期结果 { "tokens": [ { "token": "xu", "start_offset": 0, "end_offset": 5, "type": "word", "position": 0 }, { "token": "ning", "start_offset": 0, "end_offset": 5, "type": "word", "position": 1 }, { "token": "xuning", "start_offset": 0, "end_offset": 6, "type": "word", "position": 2 } ] }

Radiums avatar Aug 13 '17 04:08 Radiums

这个没有上下文,歧义很难处理。

medcl avatar Jan 15 '18 10:01 medcl

就是一个人名,徐宁

Radiums avatar Jan 16 '18 03:01 Radiums

@medcl 我也遇到同样问题,对于像xi'an 我想让他analyze成xi、an、xian,而不是只有xian,对于这种同类问题有处理办法么

fortunatelx avatar Jan 24 '18 06:01 fortunatelx

解决办法只有加词库,现在词库还不能单独扩展,我需要改一下

medcl avatar Jan 25 '18 06:01 medcl

目前表现看来似乎是根据最长匹配来进行分词的,后面能否提供类似于匹配最多拼音词的方式来分词?

Radiums avatar Jan 25 '18 08:01 Radiums