pinot Allow users to specify lucene analyzer when creating text index

Currently, the StandardAnalyzer is used. This analyzer removes special characters, so if users wanted to search for them, it would not work. It would be useful to allow the user to specify other analyzers depending on their use case.

Aug 03 '22 14:08 TT1103

We should allow choosing between Standardanalyzer and Whitespace Analyzer when creating a text field.

Aug 03 '22 14:08 atris

We should think about how best to configure the analysis chain using any available analyzers/filters/etc. E.g. we have fields with Japanese, and want to use the Kuromoji analyzer. See my comment on the "Support for Native Text Indexing in Pinot" issue.

Aug 08 '22 20:08 kkrugler

We have a requirement to use a different analyzer other than the standard analyzer.

@kkrugler, I see this as a problem with two aspects: static analyzer and dynamic analyzer per field. The former is easier to solve, using an analyzer class name(Lucene or custom provided) For the latter, it can be solved separately and may involve a little more designing:

Detector: detect which analyzer to use for the given text
AnalyzerFactory: provide an analyzer instance based on detector's result.
Configuration: Tricky, as it is hard to generalize. But the idea is how to set up detector which identifies the analyzer for the field's value.

It can be a quick win if we can start with solving the former case while working on the latter's design. Will be interested in helping here.

Oct 03 '23 19:10 rohityadav1993

+1 - let's start with the static approach. Can you come up with a design document?

Oct 03 '23 19:10 atris

@atris, I can come up with a two-part ERD. Could you assign the task to me?

Oct 04 '23 09:10 rohityadav1993

@rohityadav1993 We have implemented this requested feature because we need it right away. After testing on production clusters at a large scale, we can release the changes here.

For now, on a per text index basis, we can specify the FQCN (fully qualified class name) of the Lucene analyzer to be used for both ingestion and query. For example, for field XXX, we can specify org.apache.lucene.analysis.core.WhitespaceAnalyzer, org.apache.lucene.analysis.core.KeywordAnalyzer analyzer or any analyzer of your choice.

Nov 02 '23 00:11 jackluo923

@rohityadav1993 @kkrugler @TT1103 Could you test out the latest change on the master branch? I haven't had time to update the docs, but will do it soon. For now, just change the following to use another analyzer:

fieldConfigList: [
   {
        "name": "columnName",
        "indexType": "TEXT",
        "indexTypes": [
          "TEXT"
        ],
        "properties": {
          "luceneAnalyzerClass": "org.apache.lucene.analysis.core.KeywordAnalyzer"
        },
      }
  ]

@kkrugler For Kuromoji analyzer, is there any parameter you want to pass it to the analyzer? If so, you can let me know which ones or you can create a PR yourself. Then we can update the doc all at once.

Dec 12 '23 03:12 jackluo923

@TT1103 does this merged feature solves this issue for you? If so, we could close this issue... as soon as @jackluo923 has updated the docs...

Jun 15 '24 21:06 hpvd

We have additional feature that we want to commit to OSS that allow user specify parameters to custom analyzer and pair it with custom query parser if needed.

fieldConfigList: [
   {
        "name": "columnName",
        "indexType": "TEXT",
        "indexTypes": [
          "TEXT"
        ],
        "properties": {
          "luceneAnalyzerClass": "x.utils.lucene.analyzer.DelimiterAnalyzer",
          "luceneAnalyzerClassArgTypes": "java.lang.String, java.lang.String",
          "luceneAnalyzerClassArgs": " \\,.\n\t()[]{}\"':=-_$\\?@&|#+/,\\,.()[]{}\"':=-_$\\?@&|#+/",
          "luceneQueryParserClass": "x.utils.lucene.queryparser.AnalyzingQueryParser"
        },
      }
  ]

We've been using these configs for better half of a year in production now and it seems to work well. The PR is here, but I need too add more unit tests so that it can be merged into OSS. We'll update the docs afterwards very soon.

Jun 26 '24 17:06 jackluo923

pinot pinot copied to clipboard

Allow users to specify lucene analyzer when creating text index

pinot
pinot copied to clipboard