lucene icon indicating copy to clipboard operation
lucene copied to clipboard

HyphenationCompoundWordTokenFilter fixed token position and preserves original token

Open jetzerv opened this issue 6 months ago • 1 comments

Description

The HyphenationCompoundWordTokenFilter is the recommended decompounder for Germanic languages, recommended by Elastic Elasticsearch Docs. Although the decompounding doesn't work as expected for my use case. Let me explain with an example:

  1. The user searches for 'Sommerkleid' in a webshop (German for Summer dress)
  2. Decompounding the word 'Sommerkleid' will return 'sommerkleid', 'sommer' and 'kleid'. (The positions of all 3 tokens are position: 0)
  3. Since all tokens are on position 0, the customer gets products that container 'sommer' OR 'kleid' OR 'sommerkleid', although the customer was searching for both and not either terms. Leading to random products that are not a 'kleid', but are categorized as 'sommer' products.

Ideally there would be two extra properties to;

  1. exclude initial token from output (default false for backwards compatibility)
  2. increase position for split tokens ('sommer' would be pos: 0, 'kleid' would be pos: 1)

Would this be possible to add? I already saw a related issue from 5 years ago -> https://github.com/apache/lucene/issues/10625, although was not implemented back then.

jetzerv avatar May 07 '25 13:05 jetzerv

To address your 2nd idea (increment the position for each sub-word in the compound word), I think we'd need to create a graph-aware CompoundWordTokenFilter. It would also emit PositionLengthAttribute, and would correctly express that your original token spanned two positions, and sommer was at position 0, kleid at position 1, and Sommerkleid at position 0 but spanning two positions.

We do have a graph-aware synonym filter (SynonymGraphFilter) ... I wonder if we could enhance that to accept a HyphenationTree? Or maybe we could rewrite the German compounding rules as synonyms and use SynonymGraphFilter directly?

My long-ago blog post talks about understanding Lucene's TokenStreams as graphs, but not all TokenStreams create a graph.

mikemccand avatar May 20 '25 13:05 mikemccand