elasticsearch-analysis-decompound icon indicating copy to clipboard operation
elasticsearch-analysis-decompound copied to clipboard

Failure to decompose "Taschenhersteller"

Open blured75 opened this issue 8 years ago • 5 comments

Hi,

First of all, thanks for your plugin, which could avoid to use the obscure compound word token filter with hyphenation_decompounder (https://www.elastic.co/guide/en/elasticsearch/reference/2.0/analysis-compound-word-tokenfilter.html)

Having said that I cannot decompose "Taschenhersteller" which is a german word which should be decomposed as 2 words : Taschen & Hersteller Having installed your plugin, I made the following (possibly erroneous) mapping :

-XPOST localhost:9200/my_index {
  "index": {
    "analysis": {
      "filter": {
        "decomp": {
          "type": "decompound"
        }
      },
      "tokenizer": {
        "decomp": {
          "type": "standard",
          "filter": [
            "decomp"
          ]
        }
      },
      "analyzer": {
        "my_anal": {
          "type": "custom",
          "tokenizer": "decomp"
        }
      }
    },
    "mappings": {
      "type1": {
        "properties": {
          "field1": {
            "type": "string",
            "analyzer": "my_anal"
          }
        }
      }
    }
  }
}

When trying to analyze the text "Taschenhersteller"

-XPOST localhost:9200/my_index {
    "analyzer": "my_anal",
    "text": "Taschenhersteller"
}

It gives me

{
    "tokens": [
        {
            "token": "Taschenhersteller",
            "start_offset": 0,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 0
        }
    ]
}

Don't understand what I'm doing wrong ....

Could you help me please ? :)

blured75 avatar Mar 09 '16 16:03 blured75

Does it work for other Terms like Ölpumpe ? Motoröl?

Also you need to activate the default tokenizers like lowercase in your custom analyzer before the decomp token filter.

That where my testcases back in the days. I also noticed some terms are not splitted by the plugin at all.

Maybe we could improve the plugin together @jprante ?

My analyser looks like this:

"svb_decompoundAnalyzer":{  
  "filter":[  
    "lowercase",
    "svb_decompound",
    "unique"
     ],
  "tokenizer":"standard"
}

And filter:

 "svb_decompound":{  
   "type":"decompound"
 },

mablae avatar Mar 09 '16 21:03 mablae

The current implementation can be extended by custom compound words, for example code, see https://github.com/jprante/elasticsearch-analysis-decompound/blob/master/src/test/java/org/xbib/decompound/TrainerTests.java

Possible input for german is the morphy lexicon morphy-mapping-20110717.latin1.gz

jprante avatar Mar 09 '16 22:03 jprante

Great !! It works really well :)

This was my mapping which was erroneous. Here is the corrected version which works :

{
  "index": {
    "analysis": {
      "filter": {
        "decomp": {
          "type": "decompound"
        }
      },      
      "analyzer": {
        "my_anal": {
            "filter":[  
            "lowercase",
            "svb_decompound",
            "unique"
         ],
         "tokenizer":"standard"
        }
      }
    },
    "mappings": {
      "type1": {
        "properties": {
          "field1": {
            "type": "string",
            "analyzer": "my_anal"
          }
        }
      }
    }
  }
}

blured75 avatar Mar 10 '16 09:03 blured75

Is it possible for you to make a backport to Elastic 2.0 version ? It could be wunderbach :)

Best regards, Blured.

blured75 avatar Mar 10 '16 09:03 blured75

@jprante Thanks for that pointer, I will read into it.

mablae avatar Mar 12 '16 06:03 mablae