idea: subfield for `address_parts.number` with alpha tokens
The peliasHousenumber analyzer strips non-numeric tokens.
As discussed in https://github.com/pelias/pelias/issues/810 this is somewhat unintuitive but actually works very well.
https://github.com/pelias/schema/blob/41bd2d1daa0e924a7f73850202640ee7c3a1ad45/settings.js#L124-L128
The issue with this is that the original housenumber (including alpha characters) is lost to the document, meaning we can't do later fine-grained sorting on it.
As a workaround we're using the phrase.default field to get access to those tokens.
The disadvantage of phrase.default is that it will contain tokens from both the street and the housenumber, potentially producing undesirable matches. For non-address queries it will also contain additional tokens.
In this issue I would like to float the idea of having a 'subfield' of address_parts.number, call it something like address_parts.number.raw and use a different analyzer on it, such as peliasUnit (which doesn't strip the alpha chars).
This would remain backwards compatible while also adding an additional field address_parts.number.raw which contains both alpha and numeric tokens.
The benefits would be that we can then target this 'raw' field directly in our queries to do unit number sorting, et al.
The only minor disadvantage would be that the new field would increase the index size on-disk, although I expect this to be insubstantial (<~1%).
Also, if we're not going to use it then there's no sense in adding it.
cc/ @orangejulius @ianthetechie @Joxit
Yeah, this is a really good idea. I can't remember if we've discussed it in GitHub issues before, but we should even consider expanding it and having a "strict" and a "loose" subfield for most of our fields.
This could help in a lot of cases, for example:
- Scoring when diacriticals matter, such as Sorting of Huttenstrasse vs Hüttenstrasse
- Scoring/matching apostrophes or plurals as mentioned in https://github.com/pelias/schema/issues/434
- Housenumbers with separating characters like
Via del Ponticello 38/2 Trieste italy) (There's no issue for this yet AFAIK)
I'm sure there's more, right?
Using Elasticsearch subfields is pretty critical for this, we've known about it for a long time and IIRC it's fairly efficient compared to adding an entire new field
A very belated reply on this.... yeah, I had considered if we needed something like this for a while due to the challenge of parsers handling weird "numbers" like Via del Ponticello 38/2 (currently the Pelias API is actually part of the reason this one fails some tests, as input with slashes is assumed to be an AU/NZ style unit splitter globally) and Telliskivi 60a/3. Having a separate field would probably let us do better searches. I think the queries would "just get better" if we added a should clause targeting the raw field, right?
I also like @orangejulius's idea of applying something similar to street name too. Though I'm not yet sure whether that should be "raw" or whether it should just have slightly different filtering. You guys probably have a better handle on this than I.
Finally, on the issue of Via del Ponticello 38/2, Telliskivi 60a/3, Kossuth Lajos utca 20. IV/15, and similar strange names, this would help give more accurate relevance (currently worked around with phrase.default as @missinglink mentioned), but actually the greater fault with these is with the parser.
Basically, there's some logic that treats everything with a slash as a unit delimiter. This actually isn't true in a surprising number of places (the house number really can have a slash, or a dash for that matter). The code also assumes the unit comes first rather than the building/house number.
https://github.com/pelias/api/blob/f7735b9117633f8bb8d30ffd3a905326b9a450ee/controller/libpostal.js#L192-L203
With this removed (or modified to take into account that only certain countries commonly use the slash this way), 2 out of the 3 queries should yield the correct result. The Hungarian one requires a bit more work 😅
Hi there, I was digging all this house number since #1697 and the support of bis/ter...
If I understand correctly, the field address_parts.number
- Strip non numeric characters to improve the matching from "junk in house number data"
- Is used only on
/search - Cannot be used on
/autocompletesince we needpeliasPhraseanalyzer fromphrase.*and phrase already include the housenumber (address_parts.streetuses the analyzer ispeliasStreet) - Stripping non numeric characters leads to match everything containing the number (
3,3bis,3 bis, ...)
The key benefit is the ability to have a fallback is the user is looking for 3bis and it does not exists, 3 will be returned.
Having address_parts.number.raw will allow to boost exact match results.
I was thinking, what if the address_parts.number was containing the original value + a stripped one ?
We could manage the boosting system when an alpha is present (query 3 bis boosted 1.5 vs and 3 boosted 1.25 in a should for example) => When the user's input is 3, we will also get both 3 and 3 bis in random order but when the input is 3 bis, we will get 3 bis then 3 🤔
There are a couple of quirks to consider when using 'aliases', and in fact when using a separate field, worth considering.
Firstly the issue of 1/2 (ie. both housenumber and unit are numeric) which would be indexed as ['1', '2'], this results in a query for a housenumber potentially matching on the unit number. This may in fact already be the case with the numeric filter, which is not great.
Adding additional aliases increases the 'field length', which is used in the 'norms' calculation, this is actually my biggest peeve with elasticsearch, so it's going to do TF/IDF on the match and then normalize to the field length, making the scores a complete mess. We've overcome this in the past using a constant_score query clause.
In the bis/ter example, we would need to ensure that the housenumber is a mandatory match before considering bis/ter otherwise someone could just type bis without any number and get a fairly random list of places on the street.
I like your idea @joxit, although we need a slightly better way of defining the 'mandatory' part of the query, in most cases it's probably the first numeric section, with the rest of the string discarded (ie. something like .match(/\d+/)[0])
There are of course areas of the world where this isn't true, which complicates matters as detecting the 'mandatory' section would be required at both indexing and search time.
An alternative solution would be to write a post-processing script into pelias/model which looks at the housenumber field and splits it into housenumber and unit.
As this runs after point-in-polygon, we have information about the country which can be used to infer local naming conventions
This would mean that the index itself would contain the 'mandatory' part in address_parts.number and the optional parts in address_parts.unit.
The challenge would then be to handle a similar operation at query-time without any contextual knowledge of the country naming conventions.
Oh yeah you're right, I like the post-processing for housenumber and unit, this could be a nice alternative.
But we should also implement a new display in pelias/labels for each supported countries to rebuild the correct label with its unit?
Anyway, this is a bit out of topic, the original subfield should be interesting to work with.
An alternative solution would be to write a post-processing script into pelias/model which looks at the housenumber field and splits it into housenumber and unit.
I think this is a promising angle. I'm already planning to work on some stuff like this anyways, as there is a lot of very badly tagged stuff in Korea.
There's a fine balance here of course between having stuff neatly schematized in the database and being able to effectively search from user input, as you point out next ;)
The challenge would then be to handle a similar operation at query-time without any contextual knowledge of the country naming conventions.
Bingo. We currently use contextual info in the country field from libpostal parses when it's there in our API, but perfectly formed input is obviously the rarity not the norm. We could/should also probably look at inferring this from the boundary too.
What we'd ideally be able to do is search different ways depending on where the result is located, but last I checked that's going to be a bit tricky with Elasticsearch. I wonder if an approach where we fall back to trying several different variations could work when an address looks like it might contain a unit. That way it's resilient to variations. Could probably do this with two queries, but that slows things down... idk...