Nominatim icon indicating copy to clipboard operation
Nominatim copied to clipboard

Configurable preprocessing for queries

Open lonvia opened this issue 1 year ago • 0 comments

There have been a few cases now, where it could be interesting to add some additional processing to an incomming query before it is sent to the tokenizer. It would allow to add custom filters for nonsense queries, do some experiments with NLP pre-processing and it would be needed for the splitting of Japanese queries as proposed in #3158.

This should work in a very similar way to the sanitizers used during import, i.e. the ICU tokenizer allows to specify a list of modules with preprocessing functions that are run in sequence over the incomming query.

Configuration

The yaml for the configuration should look about the same as the sanitizer with the step key naming the module to use and any further keys setting the configuration.

Example:

query-preprocessing:
  - step: clean-by-pattern
    pattern: \d+\.\d+\.\d+.\d+
  - step: normalize
  - step: split-key-japanese-phrases

This would execute three preprocessing modules: clean_by_pattern, normalize and split_key_japanese_phrases, normalize would be the step that runs the normalization rules over the query. This is currently hard-coded in the ICU tokenizer. However, conceptually, it is a simple preprocessing step, too. So we might as well make it explicit. It also means that the user has the choice if they want to run the preprocessing on the original input or on the normalized code. This might for example be relevant for Japanese key splitting already: normalization includes rules to change from simplified to traditional Chinese characters. This looses valuable information because simplified Chinese characters are a clear sign that the input is not Japanese.

Preprocessing modules

The preprocessing modules should go into nominatim/tokenizer/query_preprocessing. Most of this should work exactly like the sanitizer, see base.py.

Each module needs to export a create function, that creates the preprocessor:

def create(config: QueryConfig) -> Callable[[QueryInfo], None]: pass

QueryConfig can be an alias to dict for the moment. We might want to add additional convenience functions as in SanitizerConfig later.

QueryInfo should have as the only field: a List[Phrase]. This should be mutable by the preprocessor function. The indirection via a QueryInfo class allows us to later add more functionality to the preprocessing without breaking existing code.

Loading the preprocessors

It is important, that the preprocessor chain is loaded only once and then cached. The setup function is the right place to do that. self.conn.get_cached_value makes sure that a setup function like _make_transliterator is only executed once. The equivalent code for setting up the Sanitizer chain is at https://github.com/osm-search/Nominatim/blob/master/nominatim/tokenizer/place_sanitizer.py#L28

The tricky part is getting the information from the yaml configuration. This needs access to the Configuration object, which is not available here. We should add this as a property to the SearchConnection class. It can be easily added from self.config when it is created here. Once this is done, something along the lines of self.conn.config.load_sub_configuration('icu_tokenizer.yaml', config='TOKENIZER_CONFIG')['query-preprocessing'] should do the trick.

Using the preprocessors

This is mostly done in PR 3158 already. The only difference would be that the list of functions is not hardcoded anymore and that the phrases are mutated inside a QueryInfo object instead of returning the mutated phrase from the function.

lonvia avatar Sep 07 '23 15:09 lonvia