api icon indicating copy to clipboard operation
api copied to clipboard

improved boosting for exact matches where the last token is in fact complete

Open missinglink opened this issue 5 years ago • 13 comments

Today I was looking into the scoring of queries for autocomplete transit stops and managed to fix some incorrect scoring in some cases:

before:

/v1/autocomplete?text=stop 2

 1)	SW 2nd & Alder (TriMet Stop ID 7095), Portland, OR, USA
 2)	W Burnside & SW 2nd (TriMet Stop ID 9526), Portland, OR, USA
 3)	SW 2nd & Taylor (TriMet Stop ID 7098), Portland, OR, USA
...

/v1/autocomplete?text=2905

 1)	2905 SW 1st Ave, Portland, OR, USA
 2)	2905 SW 4th Ave, Portland, OR, USA
 3)	2905 SE 17th Ave, Portland, OR, USA

after:

/v1/autocomplete?text=stop 2

 1)	A Ave & Chandler (TriMet Stop ID 2), Lake Oswego, OR, USA
 2)	W Burnside & SW 2nd (TriMet Stop ID 9526), Portland, OR, USA
 3)	NW Everett & 2nd (TriMet Stop ID 1612), Portland, OR, USA

/v1/autocomplete?text=2905

 1)	SE Hwy 212 & 122nd (TriMet Stop ID 2905), Clackamas County, OR, USA
 2)	2905 SW 1st Ave, Portland, OR, USA
 3)	2905 SW 4th Ave, Portland, OR, USA

So.. explaining this one is a bit complex but I'll give it a shot!

For autocomplete we split the input text in to 'tokens', and then we mark those tokens as complete or incomplete, the gist of this is that every token except the last one is 'complete', the last one we don't know if the user has finished typing that word or not.

An example query:

input: "stop 2"
tokens: [ "stop", "2" ]
complete: [ "stop" ]
incomplete: [ "2" ]

So before this PR we would already boost documents matching stop in the phrase index, this is important because it boosted exact matching phrases higher and didn't use the ngram index.

So this prevented "Clayton Avenue" and "Clay Ave" from scoring the same despite sharing common ngrams.

This works great when the first token is complete and the last one isn't.

This PR now covers the additional case where the last token is in fact also complete (something we cannot tell from just looking at the text)

So in the example above, we will generate one more subquery which also tries to match stop 2, and in this case, it would find an exact match, giving that document an additional boost.

missinglink avatar Oct 09 '18 13:10 missinglink

I also looked in to getting numerically ascending scoring in this case, such as:

?text=stop 2

Stop 2
Stop 20
Stop 21
...

... however, this is not currently possible due to how the peliasIndexOneEdgeGram treats numbers as a single token:

GET /pelias/_analyze?analyzer=peliasIndexOneEdgeGram&text=222

{
   "tokens": [
      {
         "token": "222",
         "start_offset": 0,
         "end_offset": 3,
         "type": "word",
         "position": 0
      }
   ]
}

This means we can't currently do prefix matching on numbers. I suspect we did this because storing prefix ngrams for numbers would get pretty messy pretty fast :)

missinglink avatar Oct 09 '18 13:10 missinglink

cc/ @fpurcell

missinglink avatar Oct 09 '18 13:10 missinglink

Well, I can confirm this does exactly what it says on the tin. It boosts exact matches for autocomplete. Quite a bit. This does actually negatively affect the autocomplete acceptance test scores.

  master autocomplete_boost_exact_matches Change
Pass 3551 3491 -1.69%
Fail 1683 1685 0.12%
Regressions 4556 4616 1.32%
Percent 53.46 52.86 -1.12%
Search passes, autocomplete fails 233 213 -8.58%
Jitter 73 105 43.84%
Pass only on last char 23 33 43.48%

Overall though, the code looks great, and I believe the principle is sound. Perhaps we should look at lowering the boost for this query clause, so it's a little less powerful.

Notable new failures

Most of the failures seem to be while autocompleting admin area names

Pari image

image

brook

Notable improvements

/v1/autocomplete?text=100 20th image This used to return streets that didn't match exactly, but now it does.

orangejulius avatar Oct 16 '18 01:10 orangejulius

I'm still interested in merging this, the 'Failures' represent a change, but I'm not convinced it's negative.

With our current implementation, it's impossible to find some of those records, for example:

/v1/autocomplete?text=pari, italy

 1) Métropole du Grand Paris, Villejuif, France
 2) Paris, France
 3) Paris, France
 4) Paris, France
 5) Paris, France
 6) East Baton Rouge Parish, Central, LA, USA
 7) East Baton Rouge Parish, LA, USA
 8) Jefferson Parish, LA, USA
 9) Jefferson Parish, Lafitte, LA, USA
10) Pariaman, Indonesia

So this means that there are small towns with exact linguistic matches being drowned out by larger cities to the extent that they are no longer discoverable via autocomplete.

This change would allow those exact matches to be discovered, while still allowing the user to continue specifying more of the input until they receive the required result.

[edit] in fact the query /v1/autocomplete?text=pari, estonia returns exactly the same results as the query above.

missinglink avatar Oct 25 '18 18:10 missinglink

This also repairs a use case of one of my client. Here is his use case :

/v1/autocomplete?text=Schirrhein&sources=wof,oa,osm

 1) 9 Rue De Schirrhein, Soufflenheim, France
 2) 6 Rue De Schirrhein, Soufflenheim, France
 3) 18 Rue De Schirrhein, Soufflenheim, France
 4) 33 Rue De Schirrhein, Soufflenheim, France
 5) 10 Rue De Schirrhein, Soufflenheim, France
 6) 28 Rue De Schirrhein, Soufflenheim, France
 7) 11 Rue De Schirrhein, Soufflenheim, France
 8) 25 Rue De Schirrhein, Soufflenheim, France
 9) 1 Rue De Schirrhein, Soufflenheim, France
10) 24 Rue De Schirrhein, Soufflenheim, France

As a result he only has complete addresses, except that he searches for the city. With this PR here is what he will have

/v1/autocomplete?text=Schirrhein&sources=wof,oa,osm

 1) Schirrhein, France
 2) Schirrhein, France
 3) Schirrhein, France
 4) FC Schirrhein, Schirrhein, France
 5) 9 Rue De Schirrhein, Soufflenheim, France
 6) 6 Rue De Schirrhein, Soufflenheim, France
 7) 18 Rue De Schirrhein, Soufflenheim, France
 8) 33 Rue De Schirrhein, Soufflenheim, France
 9) 10 Rue De Schirrhein, Soufflenheim, France
10) 28 Rue De Schirrhein, Soufflenheim, France

Joxit avatar Oct 26 '18 10:10 Joxit

I'm coming around to this PR.

My main objection at this point is that it adds another potential query clause to our already complex autocomplete queries (as I've been digging through them in the last few days, the number of possible query layouts is actually quite staggering!).

I still think the boosting is a little too strong, but especially for the single token input case, it's clear we are currently under-scoring exactly matching results.

I'm open to ideas to tune it a bit, but if there aren't any that are immediately obvious, we can look to simplify our autocomplete queries separately, and it seems like the good far outweighs the bad already here!

orangejulius avatar Oct 29 '18 12:10 orangejulius

The Brooklyn results concern me as well, it looks like now our queries would be predisposed to match fairly low importance records that have multiple words and don't even start with the word from the input text. Any thoughts on how we could fix this? That would be a huge improvement.

orangejulius avatar Oct 29 '18 12:10 orangejulius

I wrote up a wiki article explaining the pros/cons of different autocomplete query styles https://github.com/pelias/api/wiki/Autocomplete-search-logic

missinglink avatar Nov 09 '18 10:11 missinglink

@orangejulius I had a look at the autocomplete queries today following on from writing the wiki article linked above ^

I think that it's appropriate to generate 4 different queries for the hybrid approach, depending on the amount of tokens provided in the input and their state of completion.

single token inputs

For a single-token input there are two options:

  • The token is complete (it was output by a parser)
  • The token is incomplete (it was not able to be parsed)

In the first situation (which is fairly rare), we produce one gram per token using the peliasQueryFullToken analyser and compare it to the phrase.default index, only one query clause is generated.

In the second situation we also produce one gram per token using the peliasQueryPartialToken analyser but this time we query the name.default index, only one query clause is generated.

This PR would add a second clause in this situation which additionally queries the phrase.default index too, effectively boosting exact matches, so it would generate two clauses.

note: the difference between peliasQueryFullToken and peliasQueryPartialToken is simply how confident it is to apply synonym substitution

multiple token inputs

For a multi-token input there are also two options:

  • The last token is complete (it was output by a parser)
  • The last token is incomplete (it was not able to be parsed)

I wont repeat myself, but basically the same thing happens as explained above for the last token. In both cases we generate a query against phrase.default for the prior tokens (all but the last one).

So it generates two clauses for each of these options, and would produce a third clause with this PR.

thoughts / conclusions

I think the "hybrid" approach is superious to both the "exact" and "loose" methods, it looks impossible to avoid two clauses for a multi-token input using "hybrid"

We might be able to reduce complexity by dropping support for the "rare cases" where all tokens have been classified as complete, or at least we can just use normal search at this point. This would reduce the complexity to two possible query permutations.

I'm not 100% sure on the reasoning behind the constant_score clause, this will need more investigation.

I still think adding a third clause will 'tip the balance' of "hybrid" more towards "strict" than "loose" but I feel that while it's a change, it's a marked improvement in user experience.

We should spend a little more time tuning this balance between "strict" an "loose" to a point we feel is appropriate, I agree it might be a little too far in the direction of "strict" right now.

missinglink avatar Nov 09 '18 11:11 missinglink

Also worth noting that our min_ngram size is currently set at 2, so if a clause was to be generated against the name.default index which tested a single-character token, then the clause is discarded.

We might be able to get rid of this too, but we'd need to check the queries to ensure we're not doing something like a MUST on that which would return zero results.

missinglink avatar Nov 09 '18 12:11 missinglink

I just wrote some unit tests to cover all the different autocomplete query permutations in https://github.com/pelias/api/pull/1240

missinglink avatar Nov 09 '18 13:11 missinglink

I took a look at modifying this PR to lower the boost for exact matches a little. It didn't seem to have any effect, but might be worth a little more experimentation.

orangejulius avatar Jan 28 '19 17:01 orangejulius

This PR is so old now it's unlikely it can ever be merged, mainly because it uses the addressit parser which has long since been retired.

It's a bit of a pity though, I was investigating a report similar to the issues described here, I wonder if we can update the code and still get the benefits shown above but with the more modern parser.

missinglink avatar Aug 09 '21 11:08 missinglink