api
api copied to clipboard
Use match instead of match_phrase query for autocomplete
This is a small but meaningful change I've been meaning to make to our autocomplete queries for a while.
The primary must query in many of our autocomplete queries is a match_phrase query. This ends up being a little too strict, since match_phrase requires that every token must be present.
This PR replaces that query with a match query, and uses minimum_should_match to allow some tokens to be missing when there are 3 or more tokens in the input. The original match_phrase query is moved to the should clause to provide an additional scoring boost when all tokens are matching and in the correct order.
The main effect of this PR is increasing the number of queries that will return results.
For example: /v1/autocomplete?text=3929 saint Marks Avenue, Niagara Falls, ON, Canada.

Because we only contract abbreviations when indexing, this query (which includes the word saint when looking for an address that uses the abbreviation st in our data) was finding no results. Now it not only returns the correct result, but several variations with typos or incorrect words now also return the correct result.
Some examples that now all return the correct result at least in the top 10:
/v1/autocomplete?text=3929 st mark Avenue, Niagara Falls, ON, Canada /v1/autocomplete?text=3929 st foo Avenue, Niagara Falls, ON, Canada /v1/autocomplete?focus.point.lat=43.103698780971875&focus.point.lon=-79.02009080158956&text=3929 st Mark Avenue
While I didn't see any major regressions comparing our autocomplete test suites for this PR, we should do another quick round of testing before merging this.
This sort of change is a prerequisite for substantial work on any sort of fuzzy matching or typo correction since the Elasticsearch fuzziness options are only possible on a match query.
Here's an example of a downside for this PR:
when querying for 85 GREY mountain drive, sedona, az (note: input is grey, street name is gray), we see several not-quite-matching results:

As shown here, some of the incorrect matches even score higher than the desired result.
One way we might be able to fix this is some post-processing after queries in the API that would attempt to prune out clearly non-matching records. This could be useful in lots of different cases actually.
I just realized I actually had a misspelling in my example above, which happened to both show off some positives, and some negatives, of this PR. It allowed us to return the correct result even with slight difference in the input and address record (grey vs gray).
Spelling the streetname as it is in the data, we see the more common way this would likely manifest, with lots of non-matching housenumbers, and a non-matching street coming in after the expected result:

I noticed that queries for ordinal streets (such as 1st st) can become noisier with this PR:
The last 3 results are undesirable because they aren't what the user requested:
ie. 5 E 10th St != 5 E 5th St
..also 5 East Street is not correct for this input either.
I continue to do bits of analysis on the impact of this PR. Most queries are noisier with this, which is expected. We'll want to do some analysis of the performance impact on the increased number of documents that will match and have to be scored. It might be minor, or there might be some new cases that are problematic.
Another minor but interesting example of changes from this PR: it will effectively increase the relative amount of scoring that comes from text matches, rather than popularity or population.
Here's an example of an autocomplete query for Kansas City. Before, the more popular (and populated) city of Kansas City, MO came first. As it happens, due to a bug, the Kansas City, KS record had a slight boost to the text match. The boost was amplified by this PR enough to swap the results.

While this particular issue will be fixed, we should keep in mind that by adding another scoring query based on text match, we are essentially "devaluing" the focus point/population/popularity component of scoring. Over in https://github.com/pelias/api/issues/1205 we discuss the idea of moving from an "additive" scoring model (text + focus point + population + popularity) to a "multiplicative" scoring model (text * focus point * population * popularity), which might help avoid this problem someday.
Just for posterity, here's the change in scoring that resulted in the behavior I saw. Before, the population component accounted for 55% of the highest ranking document's score, whereas after, it's only 39%.

This PR has already somehow grown quite old!
Since then we've had the incredible #1296 come in to start using multi-match queries for autocomplete to improve support for multiple languages. I think this PR will require some reworking to support that.
What I'd actually like to do is take the pattern explored in this PR (where there is a loose match query in a must, and a strict match_phrase query in a should), and create a view for it in pelias/query, so that we can easily reuse it throughout Pelias. We might want both a match and multi_match variant.
This change overall seems negative in my testing. It seems to REALLY want to make things into intersections.
some examples
/v1/autocomplete?lang=&size=10&focus.point.lat=37.78657&focus.point.lon=-122.42001999999998&text=Smoke Vape & Beyond --> goes from perfect venue match to intersection
/v1/autocomplete?lang=&size=10&focus.point.lat=37.85597&focus.point.lon=-122.28915&text=Tenth Street & Grayson Street --> goes from perfect intersection match to "label": "East Grayson Street & Grayson Street, Mexia, TX, USA"
/v1/autocomplete?lang=&size=10&focus.point.lat=37.76429&focus.point.lon=-122.46603&text=9th+%26+Irving --> was perfect match, now different intersection match 130km away from focus
/v1/autocomplete?lang=&size=10&focus.point.lat=37.77306&focus.point.lon=-122.4734&text=15th Avenue & Fulton Street --> was perfect now, now does out-of-order on the street suffixes to "15th Street & Fulton Street, Tell City, IN, USA" many KM away and out of order
Out of 500 venue & intersection queries I saw ~60 diffs, with no wins in the first 20