elasticsearch-analysis-dynamic-synonym icon indicating copy to clipboard operation
elasticsearch-analysis-dynamic-synonym copied to clipboard

使用同义词多次请求,有很大概率请求不到数据

Open xiaoheike opened this issue 8 years ago • 11 comments

问题描述

  • 索引配置信息
{
   "test": {
      "aliases": {},
      "mappings": {
         "test": {
            "properties": {
               "text_1": {
                  "type": "string",
                  "analyzer": "synonym"
               }
            }
         }
      },
      "settings": {
         "index": {
            "creation_date": "1482891562524",
            "analysis": {
               "filter": {
                  "remote_synonym": {
                     "type": "dynamic_synonym",
                     "synonyms_path": "http://IP:PORT/waf_file/files/sw",
                     "interval": "30"
                  }
               },
               "analyzer": {
                  "synonym": {
                     "filter": [
                        "remote_synonym"
                     ],
                     "tokenizer": "ik"
                  }
               }
            },
            "number_of_shards": "5",
            "number_of_replicas": "1",
            "uuid": "NMZ4fUryRXyoZ057lQrhDA",
            "version": {
               "created": "2030299"
            }
         }
      },
      "warmers": {}
   }
}
  • 创建一条数据
PUT /test/test/1?pretty=1
{
   "text_1" : "水的密度很大"
}
  • 使用如下语法查询数次
GET /test/_search
{
    "query": {
        "query_string": {
           "default_field": "text_1",
           "analyzer": "synonym", 
           "query": "density"
        }
    }
}
  • 在文件中新增同义词:密度, density
  • 查询语法
GET /test/_search
{
    "query": {
        "query_string": {
           "default_field": "text_1",
           "analyzer": "synonym", 
           "query": "density"
        }
    }
}
  • 可以查到文档
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.16609077,
      "hits": [
         {
            "_index": "test",
            "_type": "test",
            "_id": "15",
            "_score": 0.16609077,
            "_source": {
               "text_1": "水的密度很大"
            }
         }
      ]
   }
}
  • 多次请求有很大概率无法检索到文档
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

做过如下尝试

  • shard = 1, replia = 1,不会出现上述问题
  • shard =5, replia = 1,单机两个 ES 组成集群,问题依旧存在
  • 重启ES,不会出现上述问题

不知道问题出在哪里,需要大家的帮助

xiaoheike avatar Dec 30 '16 07:12 xiaoheike

@xiaoheike ,先确定集群里面的每个节点都有安装该插件。

bells avatar Dec 31 '16 07:12 bells

@bells 不好意思,问题里边我没有描述清楚。我做一些补充说明,该问题在单台机子(只有一个ES服务实例,shard=5, replia=1)时就会出现。因此我想这个和集群中的其他机子没有安转相同插件无关。不知道你还有什么建议?

xiaoheike avatar Dec 31 '16 08:12 xiaoheike

单台机子 replia = 1 似乎没有什么意义,replia = 0 时会出现相同的情况吗?

dcais avatar Jan 04 '17 06:01 dcais

@davidcai19840412 replia = 0 也是会出现该问题的。 这一周我一直在纠结这个问题,做了能够想到的实验,但是每次都失败了。不知道你有没有遇到过呢?难道是我的使用方法不对?

xiaoheike avatar Jan 13 '17 04:01 xiaoheike

找到问题的原因了: DynamicSynonymTokenFilterFactory.create() 方法存在并发,变量DynamicSynonymTokenFilterFactory.dynamicSynonymFilters 不支持并发添加,导致部分的DynamicSynonymFilter 对象没有保存到 dynamicSynonymFilters 中。 解决方案,修改两处:

private Map<DynamicSynonymFilter, Integer> dynamicSynonymFilters = new WeakHashMap()-->private List<DynamicSynonymFilter> dynamicSynonymFilters = Collections.synchronizedList(new ArrayList<DynamicSynonymFilter>());


public void run() {
	if (synonymFile.isNeedReloadSynonymMap()) {
		synonymMap = synonymFile.reloadSynonymMap();
		for(DynamicSynonymFilter dynamicSynonymFilter : dynamicSynonymFilters) {
			dynamicSynonymFilter.update(synonymMap);
			logger.info("{} success reload synonym", indexName);
		}
	}
}

我尝试过如下修改: private Map<DynamicSynonymFilter, Integer> dynamicSynonymFilters = new WeakHashMap()-->private Map<DynamicSynonymFilter, Integer> dynamicSynonymFilters = new ConcurrentHashMap<>(); 但是在 create 方法调用时,有对象丢失,具体原因没有深究。 @bells 麻烦你验证修改哈

xiaoheike avatar Jan 19 '17 12:01 xiaoheike

遇到同样的问题,同样的语句,查询返回的记录条数,总total数,差异比较大

lzg406 avatar Mar 31 '17 13:03 lzg406

i am having the same issue. after changing synonym.txt same search that before the change returned N results, after the change (and changing the query accordingly + waiting for synonym refresh) searches result in inconsistent responses. no hits, some expected hits, all expected hits.

UPDATE: I see this is fixed in the new version. i am using an older version for elasticsearch 5.1.1. took the fix from @xiaoheike 's pull request. Thanks!

dima-goldin avatar Jun 20 '17 15:06 dima-goldin

@xiaoheike @bells 请问此问题在master中修掉了吗?

xinlmain avatar Dec 13 '18 09:12 xinlmain

应该是处理了,之前也有小伙伴询问过这个问题,根据我的分支或者我在本页得修改方案修改代码试试。@xinlmain

xiaoheike avatar Dec 13 '18 09:12 xiaoheike

为啥我多次请求同义词,会有分词结果不一样的情况?这是为什么呢?

wang690698686 avatar Mar 28 '19 12:03 wang690698686

其中“三次方”为自定义词 偶尔出现这种情况 { "tokens": [ { "token": "三", "start_offset": 0, "end_offset": 1, "type": "en", "position": 0 }, { "token": "次方", "start_offset": 1, "end_offset": 9, "type": "m", "position": 1 } ] } 想要这种情况。 { "tokens": [ { "token": "三次方", "start_offset": 0, "end_offset": 9, "type": "userDefine", "position": 0 } ] }

wang690698686 avatar Mar 28 '19 12:03 wang690698686