elasticsearch-russian-phonetics
elasticsearch-russian-phonetics copied to clipboard
Elasticsearch plugin for Russian Phonetic Analysis
= Elasticsearch plugin for Russian Phonetic Analysis Nikolay Papakha ifdef::env-github[] :tip-caption: :bulb: :note-caption: :paperclip: :important-caption: :heavy_exclamation_mark: :caution-caption: :fire: :warning-caption: :warning: endif::[] ifndef::env-github[] endif::[]
image:https://travis-ci.org/papahigh/elasticsearch-russian-phonetics.svg?branch=master["Build Status", link="https://travis-ci.org/papahigh/elasticsearch-russian-phonetics"] image:https://codecov.io/gh/papahigh/elasticsearch-russian-phonetics/branch/master/graph/badge.svg["Code Coverage for encoder project", link="https://codecov.io/gh/papahigh/elasticsearch-russian-phonetics"] image:https://img.shields.io/badge/License-Apache%202.0-blue.svg[link=https://opensource.org/licenses/Apache-2.0]
:url-throughput-benchmark: https://github.com/papahigh/elasticsearch-russian-phonetics/blob/master/benchmark/throughput.asciidoc :url-distribution-benchmark: https://github.com/papahigh/elasticsearch-russian-phonetics/blob/master/benchmark/distribution.asciidoc :url-misspellings-benchmark: https://github.com/papahigh/elasticsearch-russian-phonetics/blob/master/benchmark/misspellings_and_typos.asciidoc :url-unit-tests: https://github.com/papahigh/elasticsearch-russian-phonetics/tree/master/encoder/src/test/java/com/github/papahigh/phonetic/encoder :url-encoding-rules: https://github.com/papahigh/elasticsearch-russian-phonetics/blob/master/encoder/README.asciidoc :url-releases-page: https://github.com/papahigh/elasticsearch-russian-phonetics/blob/master/releases.asciidoc :url-issue-tracker: https://github.com/papahigh/elasticsearch-russian-phonetics/issues :url-pull-request: https://github.com/papahigh/elasticsearch-russian-phonetics/pulls :url-encoder-project: https://github.com/papahigh/elasticsearch-russian-phonetics/tree/master/encoder :url-esplugin-project: https://github.com/papahigh/elasticsearch-russian-phonetics/tree/master/esplugin
This plugin provides phonetic analysis of Russian language by exposing *russian_phonetic*
token filter
which transforms russian words to their phonetic representation or so-called phonetic code. These codes are used
for matching words and names which sound similar. The process of transformation is also known as phonetic encoding
and this plugin is able to encode millions of russian words per second with the lowest impact on GC among all encoders
compared in encoding throughput benchmarks.
NOTE: Results for {url-misspellings-benchmark}[matching misspellings and typos], {url-distribution-benchmark}[distribution] and {url-throughput-benchmark}[encoding throughput] benchmarks.
Encoding algorithm extensively employs phonetic and orthographic rules in order to fill the inconsistency gap between spelling and pronunciation in Russian Language.
[source,intent=0] .Examples of spelling and pronunciation inconsistency
вдры[зг] ⟷ вдры[ск] слове[тск]ий ⟷ славе[цк]ий ла[ндш]афт ⟷ ла[нш]афт п[я]так ⟷ п[и]так бу[хг]алтер ⟷ бу[г]алтер бю[стг]алтер ⟷ бю[зд]галтер ле[стн]ица ⟷ ле[сн]ица кислово[дск] ⟷ кислово[цк]
You can find more information about encoding process at the {url-encoding-rules}[encoding rules] and {url-unit-tests}[unit tests].
== Installation
In order to install the plugin, {url-releases-page}[choose a version] and run:
[source,sh]
$ bin/elasticsearch-plugin install URL
where *URL*
points to zip file of the appropriate release which corresponds to your elasticsearch version.
IMPORTANT: The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
E.g., command for Elasticsearch 7.6.2
[source,sh,options="wrap"]
install plugin on Elasticsearch 7.6.2
$ bin/elasticsearch-plugin install https://github.com/papahigh/elasticsearch-russian-phonetics/raw/7.6.2/esplugin/plugin-distributions/analysis-russian-phonetic-7.6.2.zip
After installation plugin exposes new token filter named *russian_phonetic*
.
== Getting started
You can start using the *russian_phonetic*
token filter by providing analysis configuration:
[source,javascript]
PUT /russian_phonetic_sample { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "standard", "filter": [ "standard", "russian_phonetic" ] } }, "filter": { "russian_phonetic": { "type": "russian_phonetic", "replace": false } } } } }
Then you should be able to hit the analyzer with *russian_phonetic*
token filter using the https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html[analyze API]
[source,javascript]
POST /russian_phonetic_sample/_analyze { "analyzer": "my_analyzer", "text": "студентка комсомолка спортсменка" }
Returns: *стднк*
, студентка
, *кмсмлк*
, комсомолка
, *спрцмнк*
, спортсменка
[[token-filter-settings]] == Token filter settings
The *russian_phonetic*
token filter provides a bunch of configuration options to meet your particular needs:
replace::
Whether or not the original token should be replaced by the phonetic code. Accepts *true*
(default) or *false*
.
vowels::
Defines encoding mode for vowels. Accepts *encode_first*
(default) or *encode_all*
.
+
[source,intent=0]
.encode_first: only first vowel in the supplied word will be encoded
упячка → упчк голландский → глнскй абсурд → апсрт
[source,intent=0] .encode_all: all vowels will be encoded according to the link:{url-encoding-rules}[encoding rules]
упячка → уп2чк1 голландский → г1л1нск2й абсурд → апс3рт
max_code_len::
The maximum length of the phonetic code. Defaults to *8*
.
enable_stemmer::
Whether or not the link:http://snowball.tartarus.org/algorithms/russian/stemmer.html[stemming] should be applied. Accepts *true*
or *false*
(default).
When this option is enabled only base (or root) form of the supplied word will be encoded.
+
[source,intent=0]
аннотируешь → антрш аннотируешься → антрш аннотируешь → ан1т2р32ш аннотируешься → ан1т2р32ш ящурным → ящрн ящурные → ящрн ящурным → ящ3рн ящурные → ящ3рн
TIP: Please take a look at the {url-throughput-benchmark}[throughput] and {url-distribution-benchmark}[distribution] benchmarks to be aware of encoder's behaviour and performance under certain options value.
== Credits
- http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html[Blog post "Phonetic algorithms"] by Nikita Smetanin
- https://lucene.apache.org/[Apache Lucene] full-featured text search engine library
- https://www.elastic.co/[Elasticsearch] distributed search and analytics engine
== Contribute Use the {url-issue-tracker}[issue tracker] and/or open {url-pull-request}[pull requests].
== Licence Both link:{url-encoder-project}[encoder] and link:{url-esplugin-project}[esplugin] projects are released under version 2.0 of the http://www.apache.org/licenses/LICENSE-2.0[Apache Licence].