elasticsuite icon indicating copy to clipboard operation
elasticsuite copied to clipboard

Support Decompound Filter - Especially for German (the Donaudampfschiff case)

Open amenk opened this issue 2 years ago • 11 comments

In German it does basically not make a difference if the product name is

Jonglierkeulen, Jonglier Keulen or Jonglier-Keulen

If the user searches for "Keulen" they all should match with a similar score.

Describe the solution you'd like

A quick Google search revealed that there Elasicsearch is generally capable of such decomposition

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-dict-decomp-tokenfilter.html

Can this be included some how?

Describe alternatives you've considered

Thesaurus was considered but we would need to create lots of entries for this.

Additional context One question is also, where the word list can come from. Can it maybe created automatically from the products which are already in the store? Or do we just use / upload common wordlist.

amenk avatar Apr 25 '22 14:04 amenk

I second this proposal, in Dutch we have mostly the same issue with compounded words.

I'd suggest putting it inside the 'frame' of the configuration for the Thesaurus.

For a customer of ours I've mostly mitigated this by using the nGram filter on just the product name. It generates fairly little false positives (do not put the nGram filter on description, that'll match on pretty much everything) with an nGram setting of min = 3 and max = 15. ( https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html )

@amenk if you want an example of how I added the nGram filter to the name, let me know and I'll send it.

florisschreuder avatar May 02 '22 09:05 florisschreuder

@florisschreuder yes, that would be interesting

amenk avatar May 02 '22 09:05 amenk

Hey @amenk ,

I added these in a custom module:

elasticsuite_indices.xml

<?xml version="1.0"?>
<indices xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:noNamespaceSchemaLocation="urn:magento:module:Smile_ElasticsuiteCore:etc/elasticsuite_indices.xsd">
    <index identifier="catalog_product" defaultSearchType="product">
        <type name="product" idFieldName="entity_id">
            <mapping>
                <field name="name" type="text">
                    <isSearchable>1</isSearchable>
                    <isUsedForSortBy>1</isUsedForSortBy>
                    <isUsedInSpellcheck>1</isUsedInSpellcheck>
                    <isFilterable>0</isFilterable>
                    <defaultSearchAnalyzer>ngram</defaultSearchAnalyzer>
                </field>
            </mapping>
        </type>
    </index>

    <index identifier="catalog_category" defaultSearchType="category">
        <type name="category" idFieldName="entity_id">
            <mapping>
                <field name="name" type="text">
                    <isSearchable>1</isSearchable>
                    <isUsedForSortBy>1</isUsedForSortBy>
                    <isUsedInSpellcheck>1</isUsedInSpellcheck>
                    <isFilterable>0</isFilterable>
                    <defaultSearchAnalyzer>ngram</defaultSearchAnalyzer>
                </field>
            </mapping>
        </type>
    </index>
</indices>

elasticsuite_analysis.xml

<?xml version="1.0"?>
<analysis xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:noNamespaceSchemaLocation="urn:magento:module:Smile_ElasticsuiteCore:etc/elasticsuite_analysis.xsd">
    <filters>
        <filter name="ngram" type="ngram" language="default">
            <min_gram>4</min_gram>
            <max_gram>15</max_gram>
        </filter>
    </filters>

    <analyzers>
        <analyzer name="ngram" tokenizer="standard" language="default">
            <filters>
                <filter ref="ascii_folding" />
                <filter ref="trim" />
                <filter ref="word_delimiter" />
                <filter ref="lowercase" />
                <filter ref="elision" />
                <filter ref="ngram" />
            </filters>
            <char_filters>
                <char_filter ref="html_strip" />
            </char_filters>
        </analyzer>
    </analyzers>
</analysis>

Plugin/Smile/ElasticsuiteCore/Search/Request/Query/Fulltext/QueryBuilder.php

<?php
/**
 * Copyright © Experius All rights reserved.
 * See COPYING.txt for license details.
 */
declare(strict_types=1);

namespace Experius\ElasticSuiteSearchOptimizerExample\Plugin\Smile\ElasticsuiteCore\Search\Request\Query\Fulltext;

use Smile\ElasticsuiteCore\Api\Index\MappingInterface;
use Smile\ElasticsuiteCore\Api\Search\Request\ContainerConfigurationInterface;
use Smile\ElasticsuiteCore\Search\Request\QueryInterface;
use Smile\ElasticsuiteCore\Search\Request\Query\QueryFactory;

class QueryBuilder
{

    /**
     * @param QueryFactory $queryFactory
     */
    public function __construct(
        QueryFactory $queryFactory
    ) {
        $this->queryFactory = $queryFactory;
    }

    /**
     * Plugin to change the Fulltext (=search) Query to include our new field + analyser
     *
     * @SuppressWarnings(PHPMD.UnusedFormalParameter)
     *
     * @param \Smile\ElasticsuiteCore\Search\Request\Query\Fulltext\QueryBuilder $subject
     * @param $result
     * @param ContainerConfigurationInterface $containerConfig
     * @param $queryText
     * @param $spellingType
     * @param $boost
     * @return mixed
     */
    public function afterCreate(
        \Smile\ElasticsuiteCore\Search\Request\Query\Fulltext\QueryBuilder $subject,
        $result,
        ContainerConfigurationInterface $containerConfig,
        $queryText,
        $spellingType,
        $boost = 1
    ) {
        /**
         * This checks whether the result is a CutOffFrequency query.
         * A query like that filters instead of just matches, so we need to add our custom field+analyser to the fields
         * This code just rewrites the fields part of the filter, and leaves the rest intact
         */
        if(get_class($result) == 'Smile\ElasticsuiteCore\Search\Request\Query\Filtered') {
            $queryParams = [
                'query'  => $result->getQuery(),
                'filter' => $this->getCutoffFrequencyQuery($containerConfig, $queryText),
                'boost' => $result->getBoost()
            ];
            return $this->queryFactory->create(QueryInterface::TYPE_FILTER, $queryParams);
        }

        return $result;
    }

    /**
     * Mostly a copy of Smile\ElasticsuiteCore\Search\Request\Query\Fulltext\QueryBuilder
     * Only used to add your own fields into the $queryParams['fields']
     *
     * @param ContainerConfigurationInterface $containerConfig Search request container configuration.
     * @param string                          $queryText       The text query.
     *
     * @return QueryInterface
     */
    private function getCutoffFrequencyQuery(ContainerConfigurationInterface $containerConfig, $queryText)
    {
        $relevanceConfig = $containerConfig->getRelevanceConfig();

        /**
         * Add your own field into the 'fields' array.
         * Note that in this case we're using 'name.ngram'
         * This is the field (name) with the analyser (ngram)
         */
        $queryParams = [
            'fields'             => array_fill_keys([MappingInterface::DEFAULT_SEARCH_FIELD, 'sku', 'name.ngram'], 1),
            'queryText'          => $queryText,
            'cutoffFrequency'    => $relevanceConfig->getCutOffFrequency(),
            'minimumShouldMatch' => $relevanceConfig->getMinimumShouldMatch(),
        ];

        return $this->queryFactory->create(QueryInterface::TYPE_MULTIMATCH, $queryParams);
    }
}

florisschreuder avatar May 02 '22 10:05 florisschreuder

This solution seems interesting, I did not test it by myself.

But beware, actually you are adding it to "default" languages, which means if your store has other localized store views (like french or english), it will get applied to them also.

I suggest using only language="de_DE" or something like that

romainruaud avatar May 02 '22 13:05 romainruaud

@romainruaud note I don't suggest this as a really good solution for everyone. nGrams can really easily have false positives - for example, my customer got pillows in their search query of "armoire", because in Dutch its parent category was "accessoire".

So decompounding words is - although it requires more manual work from the merchant - a better solution. Or atleast a safer solution, which can be easily configured for each implementation.

florisschreuder avatar May 02 '22 13:05 florisschreuder

@florisschreuder With more manual work you mean maintaining the word list? What about loading a full dictionary of valid words into the elastic search configuration, for example something like https://packages.debian.org/de/sid/wngerman

amenk avatar May 25 '22 14:05 amenk

On a second thought, Ngram might be even a better solution for our usecase than a decompound filter. So in case we go with your (@florisschreuder) solution, shall we make a commonly usable Open Source module with the code you have posted? Or do you have plans to publish that module already? Another option would be to add ngram as an option into the ElasticSuite module - @romainruaud do you think this would fit into the scope?

amenk avatar May 25 '22 14:05 amenk

Hi @amenk ,

Loading in a dictionary seems theoretically the best option. However, there are some large concerns on the technical side - mostly that you don't want to send the entire dictionary every query.

I have no plans to publish that code within a module. At the ecom agency I work we generally finetune the analyzers per implementation (and thus, create a module per implementation). You're welcome to use the code however you want.

florisschreuder avatar May 26 '22 05:05 florisschreuder

Okay, thanks. I think/hope the dictionary only would needed to be send once, when configuring the analyzer. It also takes a word-list-path:

      "word_list_path": "analysis/example_word_list.txt",

amenk avatar May 26 '22 09:05 amenk

@florisschreuder I have another question which you might be able to answer - to keep this issue focused on decompounding I asked here: https://github.com/Smile-SA/elasticsuite/discussions/2574

amenk avatar May 27 '22 07:05 amenk

Okay, thanks. I think/hope the dictionary only would needed to be send once, when configuring the analyzer. It also takes a word-list-path:

      "word_list_path": "analysis/example_word_list.txt",

Right, my bad. There are 2 ways to decompound words and one of them is to put the decompounding filter within the query itself. But that's not really efficient, so yes, your way would be better.

florisschreuder avatar May 27 '22 08:05 florisschreuder