schema icon indicating copy to clipboard operation
schema copied to clipboard

add normalizer for keyword fields

Open missinglink opened this issue 5 years ago • 4 comments

cherry-picked from https://github.com/pelias/schema/pull/412 and based on https://github.com/pelias/schema/pull/414.

This PR adds a normalizer which is the nearest thing to an analyzer for keyword fields. more info here: https://github.com/elastic/elasticsearch/issues/18064

This allows us to perform some basic normalization to fields such as layer, source and category, forcing them to be lowercased and doing some ICU normalization.

One notable change here is that those fields were previously case-sensitive and will now be case-insensitive, which I think is preferable despite there being a test which was covering this behaviour.

Note that not all keyword fields should have a normalizer specified, for instance, verbatim fields such as bounding_box and addendum are probably best left with the default null normalizer.

Normalizers are applied both at index-time and at query-time.

I would like to add some additional filters such as trim and unique but they are not available until version 6.4 of elasticsearch and so will come in a subsequent PR which can be merged independently of this.

missinglink avatar Dec 13 '19 15:12 missinglink

Is this normalizer necessary? The API is already doing the lowercase transformation for layer and source (category too ?) and there are also some check for ids :thinking:

The code looks good :smile:

Joxit avatar Dec 13 '19 16:12 Joxit

Good point, I was more thinking about trying to prevent bugs by ensuring the tokens were normalized. This normalization needs to be done both at index-time and query-time and I could see bugs easily being introduced, especially in the pelias/api code.

It's not a big deal and I'd be fine with not merging this if it comes with a performance hit.

missinglink avatar Dec 16 '19 10:12 missinglink

I found this old test case super confusing because it's asserting that the keyword field is case-sensitive even though it should never be the case 🤷‍♂

[admission of guilt] it was written by me 😝

missinglink avatar Dec 16 '19 10:12 missinglink

Ha ha ha, it's ok, 4 years ago, the statute of limitations has passed :p

Joxit avatar Dec 16 '19 10:12 Joxit