pinot
pinot copied to clipboard
Allow users to specify lucene analyzer when creating text index
Currently, the StandardAnalyzer is used. This analyzer removes special characters, so if users wanted to search for them, it would not work. It would be useful to allow the user to specify other analyzers depending on their use case.
We should allow choosing between Standardanalyzer and Whitespace Analyzer when creating a text field.
We should think about how best to configure the analysis chain using any available analyzers/filters/etc. E.g. we have fields with Japanese, and want to use the Kuromoji analyzer. See my comment on the "Support for Native Text Indexing in Pinot" issue.
We have a requirement to use a different analyzer other than the standard analyzer.
@kkrugler, I see this as a problem with two aspects: static analyzer and dynamic analyzer per field. The former is easier to solve, using an analyzer class name(Lucene or custom provided) For the latter, it can be solved separately and may involve a little more designing:
- Detector: detect which analyzer to use for the given text
- AnalyzerFactory: provide an analyzer instance based on detector's result.
- Configuration: Tricky, as it is hard to generalize. But the idea is how to set up detector which identifies the analyzer for the field's value.
It can be a quick win if we can start with solving the former case while working on the latter's design. Will be interested in helping here.
+1 - let's start with the static approach. Can you come up with a design document?
@atris, I can come up with a two-part ERD. Could you assign the task to me?
@rohityadav1993 We have implemented this requested feature because we need it right away. After testing on production clusters at a large scale, we can release the changes here.
For now, on a per text index basis, we can specify the FQCN (fully qualified class name) of the Lucene analyzer to be used for both ingestion and query. For example, for field XXX, we can specify org.apache.lucene.analysis.core.WhitespaceAnalyzer
, org.apache.lucene.analysis.core.KeywordAnalyzer
analyzer or any analyzer of your choice.
@rohityadav1993 @kkrugler @TT1103 Could you test out the latest change on the master branch? I haven't had time to update the docs, but will do it soon. For now, just change the following to use another analyzer:
fieldConfigList: [
{
"name": "columnName",
"indexType": "TEXT",
"indexTypes": [
"TEXT"
],
"properties": {
"luceneAnalyzerClass": "org.apache.lucene.analysis.core.KeywordAnalyzer"
},
}
]
@kkrugler For Kuromoji
analyzer, is there any parameter you want to pass it to the analyzer? If so, you can let me know which ones or you can create a PR yourself. Then we can update the doc all at once.
@TT1103 does this merged feature solves this issue for you? If so, we could close this issue... as soon as @jackluo923 has updated the docs...
We have additional feature that we want to commit to OSS that allow user specify parameters to custom analyzer and pair it with custom query parser if needed.
fieldConfigList: [
{
"name": "columnName",
"indexType": "TEXT",
"indexTypes": [
"TEXT"
],
"properties": {
"luceneAnalyzerClass": "x.utils.lucene.analyzer.DelimiterAnalyzer",
"luceneAnalyzerClassArgTypes": "java.lang.String, java.lang.String",
"luceneAnalyzerClassArgs": " \\,.\n\t()[]{}\"':=-_$\\?@&|#+/,\\,.()[]{}\"':=-_$\\?@&|#+/",
"luceneQueryParserClass": "x.utils.lucene.queryparser.AnalyzingQueryParser"
},
}
]
We've been using these configs for better half of a year in production now and it seems to work well. The PR is here, but I need too add more unit tests so that it can be merged into OSS. We'll update the docs afterwards very soon.