elasticsearch-analysis-decompound
elasticsearch-analysis-decompound copied to clipboard
Failure to decompose "Taschenhersteller"
Hi,
First of all, thanks for your plugin, which could avoid to use the obscure compound word token filter with hyphenation_decompounder (https://www.elastic.co/guide/en/elasticsearch/reference/2.0/analysis-compound-word-tokenfilter.html)
Having said that I cannot decompose "Taschenhersteller" which is a german word which should be decomposed as 2 words : Taschen & Hersteller Having installed your plugin, I made the following (possibly erroneous) mapping :
-XPOST localhost:9200/my_index {
"index": {
"analysis": {
"filter": {
"decomp": {
"type": "decompound"
}
},
"tokenizer": {
"decomp": {
"type": "standard",
"filter": [
"decomp"
]
}
},
"analyzer": {
"my_anal": {
"type": "custom",
"tokenizer": "decomp"
}
}
},
"mappings": {
"type1": {
"properties": {
"field1": {
"type": "string",
"analyzer": "my_anal"
}
}
}
}
}
}
When trying to analyze the text "Taschenhersteller"
-XPOST localhost:9200/my_index {
"analyzer": "my_anal",
"text": "Taschenhersteller"
}
It gives me
{
"tokens": [
{
"token": "Taschenhersteller",
"start_offset": 0,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Don't understand what I'm doing wrong ....
Could you help me please ? :)
Does it work for other Terms like Ölpumpe ? Motoröl?
Also you need to activate the default tokenizers like lowercase in your custom analyzer before the decomp token filter.
That where my testcases back in the days. I also noticed some terms are not splitted by the plugin at all.
Maybe we could improve the plugin together @jprante ?
My analyser looks like this:
"svb_decompoundAnalyzer":{
"filter":[
"lowercase",
"svb_decompound",
"unique"
],
"tokenizer":"standard"
}
And filter:
"svb_decompound":{
"type":"decompound"
},
The current implementation can be extended by custom compound words, for example code, see https://github.com/jprante/elasticsearch-analysis-decompound/blob/master/src/test/java/org/xbib/decompound/TrainerTests.java
Possible input for german is the morphy lexicon morphy-mapping-20110717.latin1.gz
Great !! It works really well :)
This was my mapping which was erroneous. Here is the corrected version which works :
{
"index": {
"analysis": {
"filter": {
"decomp": {
"type": "decompound"
}
},
"analyzer": {
"my_anal": {
"filter":[
"lowercase",
"svb_decompound",
"unique"
],
"tokenizer":"standard"
}
}
},
"mappings": {
"type1": {
"properties": {
"field1": {
"type": "string",
"analyzer": "my_anal"
}
}
}
}
}
}
Is it possible for you to make a backport to Elastic 2.0 version ? It could be wunderbach :)
Best regards, Blured.
@jprante Thanks for that pointer, I will read into it.