Decompounding Plugin for Elasticsearch
This is an implementation of a word decompounder plugin for link:http://github.com/elasticsearch/elasticsearch[Elasticsearch].
Compounding several words into one word is a property not all languages share. Compounding is used in German, Scandinavian Languages, Finnish and Korean.
This code is a reworked implementation of the link:http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/Baseforms%20Tool.htm[Baseforms Tool] found in the http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/index.htm[ASV toolbox] of http://asv.informatik.uni-leipzig.de/staff/Chris_Biemann[Chris Biemann], Automatische Sprachverarbeitung of Leipzig University.
Lucene comes with two compound word token filters, a dictionary- and a hyphenation-based variant. Both of them have a disadvantage, they require loading a word list in memory before they run. This decompounder does not require word lists, it can process german language text out of the box. The decompounder uses prebuilt Compact Patricia Tries for efficient word segmentation provided by the ASV toolbox.
.Table Compatibility matrix [frame="all"] |=== | Plugin version | Elasticsearch version | Release date | | 5.4.3 | Aug 24 2017 | | 5.4.0 | May 12 2017 | | 5.1.1 | Dec 19 2016 | | 2.4.1 | Nov 16 2016 | | 2.3.4 | Jul 30 2016 | | 2.3.3 | Jun 1 2016 | | 2.3.2 | Jun 1 2016 | | 2.3.1 | Jun 1 2016 | | 2.3.0 | Mar 31 2016 | | 2.2.1 | Mar 31 2016 | | 2.2.0 | Feb 19 2016 | | 2.1.1 | Dec 22 2015 | | 2.1.0 | Dec 8 2015 | | 1.7.1 | Nov 17 2015 | | 1.5.2 | Oct 26 2015 |===
Elasticsearch 5.x
./bin/elasticsearch-plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-decompound/
Do not forget to restart the node after installing.
Elasticsearch 2.x
./bin/plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-decompound/
All feedback is welcome! If you find issues, please post them at https://github.com/jprante/elasticsearch-analysis-decompound/issues[Github]
PUT /test { "settings": { "index": { "analysis": { "filter": { "decomp":{ "type" : "decompound" } }, "analyzer": { "decomp": { "type": "custom", "tokenizer" : "standard", "filter" : [ "decomp", "unique", "german_normalization", "lowercase" ] } } } } }, "mappings": { "docs" : { "properties": { "text" : { "type" : "text", "analyzer": "decomp" } } } } }
GET /test/docs/_mapping
PUT /test/docs/1 { "text" : "Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet" }
POST /test/docs/_search?explain { "query": { "match": { "text": "dampf schiff" } } }
"Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet" will be tokenized into "Die", "Die", "Jahresfeier", "Jahr", "feier", "der", "der", "Rechtsanwaltskanzleien", "Recht", "anwalt", "kanzlei", "auf", "auf", "dem", "dem", "Donaudampfschiff", "Donau", "dampf", "schiff", "hat", "hat", "viel", "viel", "Ökosteuer", "Ökosteuer", "gekostet", "gekosten"
It is recommended to add the https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-unique-tokenfilter.html[Unique token filter] to skip tokens that occur more than once.
The input "Ein schöner Tag in Köln im Café an der Straßenecke" will be tokenized into "Ein", "schoner", "Tag", "in", "Koln", "im", "Café", "an", "der", "Strassenecke".
The decomposing algorithm knows about a threshold when to assume words as decomposed successfully or not. If the threshold is too low, words could silently disappear from being indexed. In this case, you have to adapt the threshold so words do no longer disappear.
The default threshold value is 0.51. You can modify it in the settings::
"index" : {
"analysis" : {
"filter" : {
"decomp" : {
"type" : "decompound",
"threshold" : 0.51
Sometimes only the decomposed subwords should be indexed. For this, you can use the parameter "subwords_only": true
"index" : {
"analysis" : {
"filter" : {
"decomp" : {
"type" : "decompound",
"subwords_only" : true
The Compact Patricia Trie data structure can be found in
The compound splitter used for generating features for document classification is described in
The base form reduction step (for Norwegian) is described in
