schema
schema copied to clipboard
add normalizer for keyword fields
cherry-picked from https://github.com/pelias/schema/pull/412 and based on https://github.com/pelias/schema/pull/414.
This PR adds a normalizer
which is the nearest thing to an analyzer
for keyword
fields.
more info here: https://github.com/elastic/elasticsearch/issues/18064
This allows us to perform some basic normalization to fields such as layer
, source
and category
, forcing them to be lowercased and doing some ICU normalization.
One notable change here is that those fields were previously case-sensitive and will now be case-insensitive, which I think is preferable despite there being a test which was covering this behaviour.
Note that not all keyword
fields should have a normalizer specified, for instance, verbatim fields such as bounding_box
and addendum
are probably best left with the default null
normalizer.
Normalizers are applied both at index-time and at query-time.
I would like to add some additional filters such as trim
and unique
but they are not available until version 6.4
of elasticsearch and so will come in a subsequent PR which can be merged independently of this.
Is this normalizer necessary? The API is already doing the lowercase transformation for layer
and source
(category
too ?) and there are also some check for ids :thinking:
The code looks good :smile:
Good point, I was more thinking about trying to prevent bugs by ensuring the tokens were normalized.
This normalization needs to be done both at index-time and query-time and I could see bugs easily being introduced, especially in the pelias/api
code.
It's not a big deal and I'd be fine with not merging this if it comes with a performance hit.
I found this old test case super confusing because it's asserting that the keyword field is case-sensitive even though it should never be the case 🤷♂
[admission of guilt] it was written by me 😝
Ha ha ha, it's ok, 4 years ago, the statute of limitations has passed :p