schema icon indicating copy to clipboard operation
schema copied to clipboard

Compute an ngram field for all street names

Open missinglink opened this issue 5 years ago • 3 comments

This PR adds one new field called address_parts.street.ngram

It takes advantage of the fields mapping to generate ngrams for street names.

The ngram field is tokenized using a new tokenizers called peliasIndexStreetOneEdgeGram which is mostly the same as the peliasIndexOneEdgeGram analyzer except using different synonyms files, so as to produce prefix-ngrams which can be used for autocomplete.

The motivation here is to be able to, quite simply and efficiently, improve autocomplete queries which contain street names. We currently do autocomplete on the name.default field, which mixes names and addresses, using this field with a multi_match will have benefits over the single-field approach:

  • Allow per-field analysis including synonym substitutions which are specific to streets
  • Allow for the fields to be included or excluded at query-time using a multi_match query

I suspect the changes required for the queries will be minimal.

Its likely that the index size will be increased after merging this PR because the street names will be indexed using an edge ngram filter twice, once for name.default and once for street.ngram.

The plan is, following this PR to remove the street names from the name.default field, once this has been done the index will return to the previous size (or a very similar on-disk size).

The analysis that I've configured in this PR is likely not perfect, but it mirrors what we already have, so the integration will be easier. Once we've merge this PR and switched the queries we will be much freer to improve individual fields and analysis.

related: https://github.com/pelias/schema/pull/347 related: https://github.com/pelias/schema/pull/359

missinglink avatar May 20 '19 10:05 missinglink

This is the plan to roll this out without breaking backwards compatibility:

  1. Update schema to add street.ngram field
  2. Update queries to use multi_match
  3. Queries should now work with street data either in name.default or street.ngram (backwards and forwards compatible)
  4. Check the codebase to ensure that Documents with no name are considered valid
  5. Remove osm and oa code which concats the housenumber and street name together.
  6. Remove name.default from multi_match where applicable (remove backwards compatibility)

missinglink avatar May 20 '19 11:05 missinglink

This is a really cool feature but I didn't realise how much work would be involved in order to change the way name.default works. In particular, there would need to be changed to how search works and also how labels are generated.

Let's leave this open for discussion, I'd still very much like to merge this one day because it's a big step forward for modernizing our schema design based on learnings over the last few years.

missinglink avatar Jun 03 '19 11:06 missinglink

The increase in disk space for the whole planet was <20GB

missinglink avatar Jun 03 '19 11:06 missinglink