elasticsearch-sudachi
elasticsearch-sudachi copied to clipboard
Synonym expansion not working (Elasticsearch v8 + sudachi_split)
Summary
In an Elasticsearch v8 environment, the synonym expansion is not functioning when using sudachi_split and synonym filters together.
Steps to Reproduce
- Set up an Elasticsearch v8 environment
- Configure an index to use both
sudachi_splitandsynonymfilters - Index documents into the index
- Execute a search query containing synonyms
Expected Behavior
The synonym filter should expand synonyms, and documents containing the synonyms should be returned as hits.
Actual Behavior
Synonym expansion does not occur, and documents containing synonyms are not returned as hits.
Related Information
- In Elasticsearch v7, the sample configuration provided in the documentation worked for synonym expansion
- The documentation was last updated 4 years ago (Elasticsearch v7), and the behavior may have changed in subsequent updates
Environment
- OS:
- macOS 13.4.1
- arm64
- Docker version: 26.0.0
- Elasticsearch version: 8.8.1
- elasticsearch-sudachi version: 3.1.0
$ sw_vers
ProductName: macOS
ProductVersion: 13.4.1
BuildVersion: 22F82
$ uname -m
arm64
$ hostinfo
Mach kernel version:
Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:19 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T8103
Kernel configured for up to 8 processors.
8 processors are physically available.
8 processors are logically available.
Processor type: arm64e (ARM64E)
Processors active: 0 1 2 3 4 5 6 7
Primary memory available: 8.00 gigabytes
Default processor set: 419 tasks, 3980 threads, 8 processors
Load average: 2.02, Mach factor: 6.09
$ docker -v
Docker version 26.0.0, build 2ae903e
$ curl -X GET 'http://localhost:9200/'
{
"name" : "5edac9bc174f",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "rtQ7kzApQ-OSQQ86bnYkPg",
"version" : {
"number" : "8.8.1",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "f8edfccba429b6477927a7c1ce1bc6729521305e",
"build_date" : "2023-06-05T21:32:25.188464208Z",
"build_snapshot" : false,
"lucene_version" : "9.6.0",
"minimum_wire_compatibility_version" : "7.17.0",
"minimum_index_compatibility_version" : "7.0.0"
},
"tagline" : "You Know, for Search"
}
$ elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v3.1.0/elasticsearch-8.8.1-analysis-sudachi-3.1.0.zip
Configuration
Index settings:
{
"settings": {
"index": {
"number_of_replicas": "0",
"analysis": {
"filter": {
"search": {
"type": "sudachi_split",
"mode": "search"
},
"synonym": {
"type": "synonym",
"synonyms": ["関西国際空港,関空", "関西 => 近畿"]
}
},
"tokenizer": {
"sudachi_c_tokenizer": {
"type": "sudachi_tokenizer",
"additional_settings": "{\"systemDict\":\"system_core.dic\"}",
"discard_punctuation": "true",
"split_mode": "C"
}
},
"analyzer": {
"sudachi_search_analyzer": {
"type": "custom",
"char_filter": [],
"tokenizer": "sudachi_c_tokenizer",
"filter": ["search"]
},
"sudachi_synonym_analyzer": {
"type": "custom",
"char_filter": [],
"tokenizer": "sudachi_c_tokenizer",
"filter": ["synonym"]
},
"sudachi_synonym_search_analyzer": {
"type": "custom",
"char_filter": [],
"tokenizer": "sudachi_c_tokenizer",
"filter": ["synonym", "search"]
}
}
}
}
}
}
Analysis Results
-
With
sudachi_splitonly:$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_search_analyzer", "text" : "関西国際空港"}' { "tokens" : [ { "token" : "関西国際空港", "start_offset" : 0, "end_offset" : 6, "type" : "word", "position" : 0, "positionLength" : 3 }, { "token" : "関西", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "国際", "start_offset" : 2, "end_offset" : 4, "type" : "word", "position" : 1 }, { "token" : "空港", "start_offset" : 4, "end_offset" : 6, "type" : "word", "position" : 2 } ] } -
With
synonymfilter only:$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_synonym_analyzer", "text" : "関西国際空港"}' { "tokens" : [ { "token" : "関西国際空港", "start_offset" : 0, "end_offset" : 6, "type" : "word", "position" : 0 }, { "token" : "関空", "start_offset" : 0, "end_offset" : 6, "type" : "SYNONYM", "position" : 0 } ] } -
With both
sudachi_splitandsynonymfilter:$ curl -X GET "localhost:9200/test_sudachi/_analyze?pretty" -H 'Content-Type: application/json' -d'{"analyzer":"sudachi_synonym_search_analyzer", "text" : "関西国際空港"}' { "tokens" : [ { "token" : "関西国際空港", "start_offset" : 0, "end_offset" : 6, "type" : "word", "position" : 0, "positionLength" : 3 }, { "token" : "関西", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "国際", "start_offset" : 2, "end_offset" : 4, "type" : "word", "position" : 1 }, { "token" : "空港", "start_offset" : 4, "end_offset" : 6, "type" : "word", "position" : 2 } ] }The synonym expansion (関空) is expected but not occurring.
Questions
- Is there a way to make synonym expansion work when using
sudachi_splitandsynonymfilters together in an Elasticsearch v8 environment? - Are there any reported issues or documents describing a similar problem?
- Have any workarounds or alternative configuration methods been found for this issue?
Any help or guidance would be greatly appreciated. Thank you in advance.