= Elasticsearch plugin for Russian Phonetic Analysis Nikolay Papakha ifdef::env-github[] :tip-caption: :bulb: :note-caption: :paperclip: :important-caption: :heavy_exclamation_mark: :caution-caption: :fire: :warning-caption: :warning: endif::[] ifndef::env-github[] endif::[]

image:https://travis-ci.org/papahigh/elasticsearch-russian-phonetics.svg?branch=master["Build Status", link="https://travis-ci.org/papahigh/elasticsearch-russian-phonetics"] image:https://codecov.io/gh/papahigh/elasticsearch-russian-phonetics/branch/master/graph/badge.svg["Code Coverage for encoder project", link="https://codecov.io/gh/papahigh/elasticsearch-russian-phonetics"] image:https://img.shields.io/badge/License-Apache%202.0-blue.svg[link=https://opensource.org/licenses/Apache-2.0]

:url-throughput-benchmark: https://github.com/papahigh/elasticsearch-russian-phonetics/blob/master/benchmark/throughput.asciidoc :url-distribution-benchmark: https://github.com/papahigh/elasticsearch-russian-phonetics/blob/master/benchmark/distribution.asciidoc :url-misspellings-benchmark: https://github.com/papahigh/elasticsearch-russian-phonetics/blob/master/benchmark/misspellings_and_typos.asciidoc :url-unit-tests: https://github.com/papahigh/elasticsearch-russian-phonetics/tree/master/encoder/src/test/java/com/github/papahigh/phonetic/encoder :url-encoding-rules: https://github.com/papahigh/elasticsearch-russian-phonetics/blob/master/encoder/README.asciidoc :url-releases-page: https://github.com/papahigh/elasticsearch-russian-phonetics/blob/master/releases.asciidoc :url-issue-tracker: https://github.com/papahigh/elasticsearch-russian-phonetics/issues :url-pull-request: https://github.com/papahigh/elasticsearch-russian-phonetics/pulls :url-encoder-project: https://github.com/papahigh/elasticsearch-russian-phonetics/tree/master/encoder :url-esplugin-project: https://github.com/papahigh/elasticsearch-russian-phonetics/tree/master/esplugin

This plugin provides phonetic analysis of Russian language by exposing *russian_phonetic* token filter which transforms russian words to their phonetic representation or so-called phonetic code. These codes are used for matching words and names which sound similar. The process of transformation is also known as phonetic encoding and this plugin is able to encode millions of russian words per second with the lowest impact on GC among all encoders compared in encoding throughput benchmarks.

NOTE: Results for {url-misspellings-benchmark}[matching misspellings and typos], {url-distribution-benchmark}[distribution] and {url-throughput-benchmark}[encoding throughput] benchmarks.

Encoding algorithm extensively employs phonetic and orthographic rules in order to fill the inconsistency gap between spelling and pronunciation in Russian Language.

[source,intent=0] .Examples of spelling and pronunciation inconsistency

вдры[зг] ⟷ вдры[ск] слове[тск]ий ⟷ славе[цк]ий ла[ндш]афт ⟷ ла[нш]афт п[я]так ⟷ п[и]так бу[хг]алтер ⟷ бу[г]алтер бю[стг]алтер ⟷ бю[зд]галтер ле[стн]ица ⟷ ле[сн]ица кислово[дск] ⟷ кислово[цк]

You can find more information about encoding process at the {url-encoding-rules}[encoding rules] and {url-unit-tests}[unit tests].

== Installation

In order to install the plugin, {url-releases-page}[choose a version] and run:

[source,sh]

$ bin/elasticsearch-plugin install URL

where *URL* points to zip file of the appropriate release which corresponds to your elasticsearch version.

IMPORTANT: The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

E.g., command for Elasticsearch 7.6.2

[source,sh,options="wrap"]

install plugin on Elasticsearch 7.6.2

$ bin/elasticsearch-plugin install https://github.com/papahigh/elasticsearch-russian-phonetics/raw/7.6.2/esplugin/plugin-distributions/analysis-russian-phonetic-7.6.2.zip

After installation plugin exposes new token filter named *russian_phonetic*.

== Getting started

You can start using the `russian_phonetic` token filter by providing analysis configuration: [source,javascript]

PUT /russian_phonetic_sample { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "standard", "filter": [ "standard", "russian_phonetic" ] } }, "filter": { "russian_phonetic": { "type": "russian_phonetic", "replace": false } } } } }

Then you should be able to hit the analyzer with `russian_phonetic` token filter using the https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html[analyze API] [source,javascript]

POST /russian_phonetic_sample/_analyze { "analyzer": "my_analyzer", "text": "студентка комсомолка спортсменка" }

Returns: *стднк*, студентка, *кмсмлк*, комсомолка, *спрцмнк*, спортсменка

[[token-filter-settings]] == Token filter settings

The *russian_phonetic* token filter provides a bunch of configuration options to meet your particular needs:

replace:: Whether or not the original token should be replaced by the phonetic code. Accepts `true` (default) or `false`. vowels:: Defines encoding mode for vowels. Accepts `encode_first` (default) or `encode_all`. + [source,intent=0] .encode_first: only first vowel in the supplied word will be encoded

упячка → упчк голландский → глнскй абсурд → апсрт

[source,intent=0] .encode_all: all vowels will be encoded according to the link:{url-encoding-rules}[encoding rules]

упячка → уп2чк1 голландский → г1л1нск2й абсурд → апс3рт

max_code_len:: The maximum length of the phonetic code. Defaults to `8`. enable_stemmer:: Whether or not the link:http://snowball.tartarus.org/algorithms/russian/stemmer.html[stemming] should be applied. Accepts `true` or `false` (default). When this option is enabled only base (or root) form of the supplied word will be encoded. + [source,intent=0]

аннотируешь → антрш аннотируешься → антрш аннотируешь → ан1т2р32ш аннотируешься → ан1т2р32ш ящурным → ящрн ящурные → ящрн ящурным → ящ3рн ящурные → ящ3рн

TIP: Please take a look at the {url-throughput-benchmark}[throughput] and {url-distribution-benchmark}[distribution] benchmarks to be aware of encoder's behaviour and performance under certain options value.

== Credits

http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html[Blog post "Phonetic algorithms"] by Nikita Smetanin
https://lucene.apache.org/[Apache Lucene] full-featured text search engine library
https://www.elastic.co/[Elasticsearch] distributed search and analytics engine

== Contribute Use the {url-issue-tracker}[issue tracker] and/or open {url-pull-request}[pull requests].

== Licence Both link:{url-encoder-project}[encoder] and link:{url-esplugin-project}[esplugin] projects are released under version 2.0 of the http://www.apache.org/licenses/LICENSE-2.0[Apache Licence].

elasticsearch-russian-phonetics
elasticsearch-russian-phonetics copied to clipboard

Metadata

[source,intent=0] .Examples of spelling and pronunciation inconsistency

[source,sh]

$ bin/elasticsearch-plugin install URL

[source,sh,options="wrap"]

install plugin on Elasticsearch 7.6.2

$ bin/elasticsearch-plugin install https://github.com/papahigh/elasticsearch-russian-phonetics/raw/7.6.2/esplugin/plugin-distributions/analysis-russian-phonetic-7.6.2.zip

You can start using the `russian_phonetic` token filter by providing analysis configuration: [source,javascript]

PUT /russian_phonetic_sample { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "standard", "filter": [ "standard", "russian_phonetic" ] } }, "filter": { "russian_phonetic": { "type": "russian_phonetic", "replace": false } } } } }

Then you should be able to hit the analyzer with `russian_phonetic` token filter using the https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html[analyze API] [source,javascript]

POST /russian_phonetic_sample/_analyze { "analyzer": "my_analyzer", "text": "студентка комсомолка спортсменка" }

упячка → упчк голландский → глнскй абсурд → апсрт

[source,intent=0] .encode_all: all vowels will be encoded according to the link:{url-encoding-rules}[encoding rules]

упячка → уп2чк1 голландский → г1л1нск2й абсурд → апс3рт

аннотируешь → антрш аннотируешься → антрш аннотируешь → ан1т2р32ш аннотируешься → ан1т2р32ш ящурным → ящрн ящурные → ящрн ящурным → ящ3рн ящурные → ящ3рн

← Metadata

Owner

Metadata

elasticsearch-russian-phonetics elasticsearch-russian-phonetics copied to clipboard

Metadata

[source,intent=0] .Examples of spelling and pronunciation inconsistency

[source,sh]

$ bin/elasticsearch-plugin install URL

[source,sh,options="wrap"]

install plugin on Elasticsearch 7.6.2

$ bin/elasticsearch-plugin install https://github.com/papahigh/elasticsearch-russian-phonetics/raw/7.6.2/esplugin/plugin-distributions/analysis-russian-phonetic-7.6.2.zip

You can start using the *russian_phonetic* token filter by providing analysis configuration: [source,javascript]

PUT /russian_phonetic_sample { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "standard", "filter": [ "standard", "russian_phonetic" ] } }, "filter": { "russian_phonetic": { "type": "russian_phonetic", "replace": false } } } } }

Then you should be able to hit the analyzer with *russian_phonetic* token filter using the https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html[analyze API] [source,javascript]

POST /russian_phonetic_sample/_analyze { "analyzer": "my_analyzer", "text": "студентка комсомолка спортсменка" }

упячка → упчк голландский → глнскй абсурд → апсрт

[source,intent=0] .encode_all: all vowels will be encoded according to the link:{url-encoding-rules}[encoding rules]

упячка → уп2чк1 голландский → г1л1нск2й абсурд → апс3рт

аннотируешь → антрш аннотируешься → антрш аннотируешь → ан1т2р32ш аннотируешься → ан1т2р32ш ящурным → ящрн ящурные → ящрн ящурным → ящ3рн ящурные → ящ3рн

← Metadata

Owner

Metadata

elasticsearch-russian-phonetics
elasticsearch-russian-phonetics copied to clipboard

You can start using the `russian_phonetic` token filter by providing analysis configuration: [source,javascript]

Then you should be able to hit the analyzer with `russian_phonetic` token filter using the https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html[analyze API] [source,javascript]