groonga icon indicating copy to clipboard operation
groonga copied to clipboard

Add support for stop word with separated stop word table

Open kou opened this issue 2 years ago • 0 comments

What is your problem?

TokenFilterStopWord adds a column to a lexicon to indicate whether the term is a stop word or not. We can't use TokenFilterStopWord with PGroonga because users can't add a custom column to a lexicon in PGroonga.

If we add support for separated stop word table, PGroonga users can use TokenFilterStopWord. For example:

plugin_register token_filters/stop_word

table_create StopWords TABLE_PAT_KEY ShortText \
  --normalizers NormalizerNFKC150
column_create StopWords is_stop_word COLUMN_SCALAR Bool
load --table
[
{"_key": "and", "is_stop_word": true}
]

table_create Lexicon TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenMecab \
  --normalizers NormalizerNFKC150 \
  --token_filters `TokenFilterStopWord("column", "StopWords.is_stop_word")`
  • StopWords._key's type must equal to Lexicon._key's type
  • StopWords.is_stop_word must be Bool type
  • It may be better that we may not reuse column option for TokenFilterStopWord

How to reproduce it

No response

kou avatar Feb 20 '23 00:02 kou