schema icon indicating copy to clipboard operation
schema copied to clipboard

apply unidirectional synonyms at query-time

Open missinglink opened this issue 5 years ago • 4 comments

as of today we finally removed all unidirectional synonyms (ones using the a=>b syntax) from our default synonyms file 🎉

unfortunately, I realized that there is a bug which is preventing those unidirectional synonyms from working properly when users specify them in a custom configuration.

as per the example below, it's possible to index the term "hello" and then not be able to retrieve the document using the term "hello" 🤔

the solution to this problem is to split all the synonyms into two buckets, one for unidirectional synonyms (a=>b syntax) and one for bidirectional synonyms (a,b syntax), we will then need to apply both buckets at index-time and only the unidirectional synonyms at query-time.

curl -s -XDELETE "http://localhost:9200/foo?pretty=true"

curl -s -XPUT "http://localhost:9200/foo?pretty=true" \
  -H 'Content-Type: application/json' \
  -d '{
      "settings" : {
        "analysis": {
          "filter" : {
            "mySynonym" : {
              "type" : "synonym",
              "synonyms" : [
                "hello => world"
              ]
            }
          },
          "analyzer": {
            "myAnalyzer": {
              "type": "custom",
              "tokenizer": "standard",
              "filter": [
                "mySynonym"
              ]
            }
          }
        }
      },
      "mappings" : {
        "_doc" : {
          "properties" : {
            "field1": {
              "type": "text",
              "analyzer": "myAnalyzer",
              "search_analyzer": "standard"
            }
          }
        }
      }
    }'

curl -s -XPOST "http://localhost:9200/foo/_doc/example?pretty=true" \
  -H 'Content-Type: application/json' \
  -d '{
      "field1": "hello"
    }'

curl -s -XPOST "http://localhost:9200/foo/_refresh?pretty=true"

curl -XGET "http://localhost:9200/foo/_search?pretty=true" \
  -H 'Content-Type: application/json' \
  -d '{
      "query": {
        "match": {
          "field1": "hello"
        }
      }
    }'

missinglink avatar Dec 12 '19 20:12 missinglink

a workaround, for now, is to duplicate the token from the left side of the => on the right side as such:

hello => hello, world

missinglink avatar Dec 12 '19 20:12 missinglink

So we've now done this for the name field, and the address_parts.street field with https://github.com/pelias/api/pull/1444. Are there other fields we should do the same for, or is this all done?

orangejulius avatar Jun 26 '20 17:06 orangejulius

This is only really relevant for custom user-defined synonyms and doesn't affect stock-standard Pelias.

So if a user added a synonym foo => bar in custom_name for instance then all instances of 'foo' at index-time would be replaced by 'bar' yet at query-time there is no such replacement, meaning the doc doesn't match a query that is verbatim the same as what was in the source data.

Let's leave this open for now so we remember, I'll try and fix it at some point but it's a relatively low priority because it may not even affect anyone!

missinglink avatar Jun 26 '20 18:06 missinglink

One totally valid fix is just to say we don't support the => syntax at all, or that we warn anyone who uses it.

missinglink avatar Jun 26 '20 18:06 missinglink