schema
schema copied to clipboard
Compute an ngram field for all street names
This PR adds one new field called address_parts.street.ngram
It takes advantage of the fields
mapping to generate ngrams for street names.
The ngram field is tokenized using a new tokenizers called peliasIndexStreetOneEdgeGram
which is mostly the same as the peliasIndexOneEdgeGram
analyzer except using different synonyms files, so as to produce prefix-ngrams which can be used for autocomplete.
The motivation here is to be able to, quite simply and efficiently, improve autocomplete queries which contain street names.
We currently do autocomplete on the name.default
field, which mixes names and addresses, using this field with a multi_match
will have benefits over the single-field approach:
- Allow per-field analysis including synonym substitutions which are specific to streets
- Allow for the fields to be included or excluded at query-time using a
multi_match
query
I suspect the changes required for the queries will be minimal.
Its likely that the index size will be increased after merging this PR because the street names will be indexed using an edge ngram filter twice, once for name.default
and once for street.ngram
.
The plan is, following this PR to remove the street names from the name.default
field, once this has been done the index will return to the previous size (or a very similar on-disk size).
The analysis that I've configured in this PR is likely not perfect, but it mirrors what we already have, so the integration will be easier. Once we've merge this PR and switched the queries we will be much freer to improve individual fields and analysis.
related: https://github.com/pelias/schema/pull/347 related: https://github.com/pelias/schema/pull/359
This is the plan to roll this out without breaking backwards compatibility:
- Update schema to add
street.ngram
field - Update queries to use
multi_match
- Queries should now work with street data either in
name.default
orstreet.ngram
(backwards and forwards compatible) - Check the codebase to ensure that Documents with no name are considered valid
- Remove osm and oa code which concats the housenumber and street name together.
- Remove
name.default
frommulti_match
where applicable (remove backwards compatibility)
This is a really cool feature but I didn't realise how much work would be involved in order to change the way name.default
works.
In particular, there would need to be changed to how search works and also how labels are generated.
Let's leave this open for discussion, I'd still very much like to merge this one day because it's a big step forward for modernizing our schema design based on learnings over the last few years.
The increase in disk space for the whole planet was <20GB