api icon indicating copy to clipboard operation
api copied to clipboard

autocomplete: use should instead of must for admin matching

Open Joxit opened this issue 4 years ago • 2 comments

Hi there,

Background

Currently when we do an autocomplete including subject + admin part in a language different than English, we receive no results. The example I often use is Parijs, Frankrijk in Dutch.

First try

After the fail of https://github.com/pelias/whosonfirst/pull/492, I was looking for an alternative to get a result for Parijs, Frankrijk in Dutch.

In ES, pelias documents contain this information:

  • item name in many languages (e.g. name.fr = [Paris, Ville-Lumière], name.nl = Parijs...)
  • item admin hierarchy (e.g parent.country = France, parent.macroregion = Île-De-France...)

The idea here was to use two ES queries for one autocomplete with this workflow:

  1. Parse the input with pelias-parser
  2. If the admin part is not empty
    1. Create a ES query that will return all documents where name = Frankrijk or name.nl = Frankrijk with a coarse layer
    2. Generate a should clause where parent.*_id = id
  3. Use the current autocomplete query and add the new should clause.

This caused side effects (-3% in acceptance tests...), so I abandoned this solution... :disappointed:

Second try

I moved the scoring of admin components from must clause to should. This solution probably won't solve all issues, but it's a good start.

It works just fine with Parijs, Frankrijk and improved some queries

275,276c275,290
<   ✘ [1] "/v1/autocomplete?text=412 Saint Patrick St, donaldsonville, la": no results returned
<   ✘ [2] "/v1/autocomplete?text=412 St Patrick St, donaldsonville, la": no results returned
---
>   ✘ [1] "/v1/autocomplete?text=412 Saint Patrick St, donaldsonville, la": score 3 out of 5
>   diff:
>     street
>       expected: Saint Patrick Street
>       actual:   Patrick St
>     locality
>       expected: Donaldsonville
>       actual:   Minden
>   ✘ [2] "/v1/autocomplete?text=412 St Patrick St, donaldsonville, la": score 3 out of 5
>   diff:
>     street
>       expected: Saint Patrick Street
>       actual:   Patrick St
>     locality
>       expected: Donaldsonville
>       actual:   Minden
386c400,402
<   ✘ [7-2] "/v1/autocomplete?lang=ru&text=8 Марта, Белоруссия": no results returned
---
>   ✘ [7-2] "/v1/autocomplete?lang=ru&text=8 Марта, Белоруссия": score 3 out of 4
>   diff:
>     priorityThresh is 1 but found at position 4
407c423,431
<   ✘ [5] "/v1/autocomplete?sources=osm&layers=street&text=dionysiou areopagitou, athens": no results returned
---
>   ✘ [5] "/v1/autocomplete?sources=osm&layers=street&text=dionysiou areopagitou, athens": score 3 out of 6
>   diff:
>     name
>       expected: Dionysiou Areopagitou
>       actual:   Dionysiou
>     street
>       expected: Dionysiou Areopagitou
>       actual:   Dionysiou
>     'Dionysiou, Greece' is not close enough: distance is 203847m but should be under 500m

Results for Parijs, Frankrijk. Since this is autocomplete, I think it's not strange to have results that don't match 100% the name :thinking:

Parijs, Frankrijk
Parijs, Amsterdam, Nederland
Stad Parijs, Hulten, Nederland
Stadspalazzo Parijs, Apeldoorn, Nederland
Parijs, ON, Canada
Parijs, Zuid-Afrika
Parijs, België
Parijs, Zambia
Parijs, Lochristi, België
Goed te Parijs, Deinze, België

related #1296

Joxit avatar Jan 27 '21 15:01 Joxit

Ah yes, this is a very tricky problem to solve.

First, I definitely agree that running multiple Elasticsearch queries in the Autocomplete endpoint is not a good idea. It works for the search endpoint because response time is not as much of an issue, but for autocomplete we should look elsewhere for solutions.

I also think you'll find that this PR is nearly equivalent to removing the two admin multi-match queries entirely, at least for examples like the one you gave. There's no documents with Frankrijk in the admin fields at all as far as I know. That means you might as well be querying for Parijs, NotARealPlaceInTheWorld.

The other downside is that should queries, in general, lead to slower response times from Elasticsearch, as the number of documents that match can increase significantly. We did quite a bit of work a few years back to cut down on queries that were matching hundreds of millions of documents and destroying performance, so it would be a shame to move in the wrong direction.

Next steps

I actually think the steps we need to take to make queries like this work are fairly drastic, but still worth undertaking. Don't quote me on this being 100% the way forward (I've only just had my morning 🍵 ), but I imagine we will want to:

  • Support language filtering at import time (https://github.com/pelias/pelias/issues/867). There are just too many languages out there and realistically most people only care about a few. Additionally, different Pelias users might care about very different languages (consider Geocode Earth based in North America, vs Jawg in Europe vs another in Asia). We might want to cut out some of the "junk" WOF names as well.
  • Start using a "combined" name/admin field for autocomplete. Autocomplete really cares about field length and term positions for accurate matches and scoring, so what we really need (based on our explorations in https://github.com/pelias/pelias/issues/862) is a single field that would have contents in a single language like Paris, France or Parijs, Frankrijk, but never 巴黎, Francia(chinese, italian). I think this means having lots more documents in Elasticsearch (several per record in WOF) that are a bit smaler
  • Then, we can then re-attempt something like https://github.com/pelias/whosonfirst/pull/492. While adding all WOF names is probably not worth it, adding a smaller subset might be

orangejulius avatar Jan 27 '21 15:01 orangejulius

Good morning and thank you for your feedback :stuck_out_tongue:

Hum... :thinking: Yes we can create many documents or create a new filed name label ?

This label will contain the full label by lang (e.g label.nl = Parijs, Frankrijk; label.fr = Paris, France) and autocomplete will search in both name.xx (subject only for boost) and label.xx.

This will avoid documents growth and id rework. In order to have something like this, pelias/spatial will be mandatory (endpoint with all translations) and we will need pelias-labels :thinking:

Aliases will still be a problem though.

I will work on pelias/pelias#867 tomorrow :+1:

Joxit avatar Jan 27 '21 17:01 Joxit