es-language-char-filter
es-language-char-filter copied to clipboard
A elasticsearch char filter for dividing multi-language into different fields
es-language-char-filter
A elasticsearch char filter for dividing multi-language to different fields.
Introduction
Some analyzers of Elasticsearch are based on language such as english analyzer. They tokenize documents into terms according to the specific grammars.
However, the analyzers will not bypass the foreign language. When tacking multi-language documents, there is a recommaned solution that creating sub-fields which use the special analyzer depending on its language. The foreign lanuage would impact the accurate and efficiency of search.
For instance,
POST _analyze
{
"analyzer": "english",
"text": "We are going to meet at 中山路."
}
The generated terms are,
[ "we", "go", "meet", "中", "山", "路" ]
"中山路" is actually a road name and it is divided into independent charaters.
If we switch to Chinese analyzer, for instance, IK, it tokenizes Chinese characters correctly but keep English words,
POST _analyze
{
"analyzer": "ik_smart",
"text": "We are going to meet at 中山路."
}
The generated terms are,
["we", "going", "meet", "中山路"]
This time, Chinese characters tokenized correctly, but it also kept English word.
Both of these cases will leed to search issues because the matching score covers all the fields. This char filter is aiming at filtering languages to make one filed only storing one language terms.
Usage
Install
- Download released zip file from: https://github.com/stormisover/es-language-char-filter/releases/download/0.1/language-char-filter-0.1.zip
- Unzip to elasticsearch/plugin/language-char-filter
- Restart elasticsearch
Definition
Define your char filter
"char_filter": {
"language_char_filter" : {
"type": "language_char_filter",
"lang": "EN"
}
}
The paramter lang is used to assign that which language should be filtered. The valid value is,
- zh-CN
- EN