pinot
pinot copied to clipboard
[SchemaConformingTransformer]Add one more json_data field
feature ingestion
To ingest json-like data through SchemaConformingTransformer, previously, we could only dump data to either dedicated column or a single json_data field. We may build json index on json_data field to serve queries.
However, in reality, users may e.g. dump data like request/response body to the json data and query them. For those special json field, they may want special treatment like more levels of json index.
With this patch, among the non-dedicated column data, we will be able to dump them to either default field or a special field filtered by its prefix. And in real use cases, users could choose to build more index or add other special treatments to it.
One example, user might set 3 levels json on the default json_data field but needs some part of json (e.g. request and response) has deeper level index, e.g. 5.
Then they could set indexableSpecialFieldPrefix: ["request", "response"] in the SchemaConformingTransformerV2 config, and add json_data_inf_idx as json index in the config with 5 levels.
"schemaConformingTransformerV2Config": {
"enableIndexableExtras": true,
"indexableExtrasField": "json_data",
"enableUnindexableExtras": true,
"indexableSpecialFieldPrefix": ["request", "response"],
"indexableSpecialField": "json_data_inf_idx",
"unindexableExtrasField": "json_data_no_idx",
"unindexableFieldSuffix": "_noindex",
"fieldPathsToDrop": [],
"fieldPathsToSkipStorage": [
"message"
],
"columnNameToJsonKeyPathMap": {},
"mergedTextIndexFields": [
"__mergedTextIndex"
],
"optimizeCaseInsensitiveSearch": false,
"reverseTextIndexKeyValueOrder": true,
"mergedTextIndexDocumentMaxLength": 32766,
"mergedTextIndexBinaryDocumentDetectionMinLength": 512,
"mergedTextIndexPathToExclude": [
"_timestampMillisNegative",
"__mergedTextIndex",
"_timestampMillis"
],
"fieldsToDoubleIngest": [],
"jsonKeyValueSeparator": "\u001e",
"mergedTextIndexBeginOfDocAnchor": "\u0002",
"mergedTextIndexEndOfDocAnchor": "\u0003",
"fieldPathsToPreserveInput": [],
"fieldPathsToPreserveInputWithIndex": []
},
"jsonIndexConfigs": {
"json_data": {
"disabled": false,
"maxLevels": 3,
"excludeArray": true,
"disableCrossArrayUnnest": true,
"maxValueLength": 1000,
"skipInvalidJson": true
},
"json_data_inf_idx": {
"disabled": false,
"maxLevels": 5,
"excludeArray": true,
"disableCrossArrayUnnest": true,
"maxValueLength": 1000,
"skipInvalidJson": true
}
},
Codecov Report
Attention: Patch coverage is 70.45455% with 13 lines in your changes missing coverage. Please review.
Project coverage is 63.98%. Comparing base (
59551e4) to head (ef3cd01). Report is 1094 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #14097 +/- ##
============================================
+ Coverage 61.75% 63.98% +2.23%
- Complexity 207 1535 +1328
============================================
Files 2436 2596 +160
Lines 133233 143330 +10097
Branches 20636 21956 +1320
============================================
+ Hits 82274 91714 +9440
+ Misses 44911 44866 -45
- Partials 6048 6750 +702
| Flag | Coverage Δ | |
|---|---|---|
| custom-integration1 | 100.00% <ø> (+99.99%) |
:arrow_up: |
| integration | 100.00% <ø> (+99.99%) |
:arrow_up: |
| integration1 | 100.00% <ø> (+99.99%) |
:arrow_up: |
| integration2 | 0.00% <ø> (ø) |
|
| java-11 | 63.96% <70.45%> (+2.25%) |
:arrow_up: |
| java-21 | 63.87% <70.45%> (+2.24%) |
:arrow_up: |
| skip-bytebuffers-false | 63.98% <70.45%> (+2.23%) |
:arrow_up: |
| skip-bytebuffers-true | 63.83% <70.45%> (+36.11%) |
:arrow_up: |
| temurin | 63.98% <70.45%> (+2.23%) |
:arrow_up: |
| unittests | 63.98% <70.45%> (+2.23%) |
:arrow_up: |
| unittests1 | 55.55% <0.00%> (+8.65%) |
:arrow_up: |
| unittests2 | 34.48% <70.45%> (+6.75%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
However, in reality, users may e.g. dump data like request/response body to the json data and query them. For those special json field, they may want special treatment like more levels of json index.
Can you add one example with data to illustrate the use case?
With this patch, among the non-dedicated column data, we will be able to dump them to either default field or a special field filtered by its prefix. And in real use cases, users could choose to build more index or add other special treatments to it.
One example can be added to explain the new feature added in the PR summary. Are you also proposing another new config to the transfer? If so, please list it in the summary too.
However, in reality, users may e.g. dump data like request/response body to the json data and query them. For those special json field, they may want special treatment like more levels of json index.
Can you add one example with data to illustrate the use case?
With this patch, among the non-dedicated column data, we will be able to dump them to either default field or a special field filtered by its prefix. And in real use cases, users could choose to build more index or add other special treatments to it.
One example can be added to explain the new feature added in the PR summary. Are you also proposing another new config to the transfer? If so, please list it in the summary too.
Thanks for the review, updated the summary with an example.
Why can't we use the includePaths config in Json index configuration option to explicitly index a certain path with any depth?
https://docs.pinot.apache.org/basics/indexing/json-index#enable-and-configure-a-json-index
Why can't we use the includePaths config in Json index configuration option to explicitly index a certain path with any depth?
https://docs.pinot.apache.org/basics/indexing/json-index#enable-and-configure-a-json-index
IIUC, the includePaths config is specified in the json index config of a field like
"json_data": {
"disabled": false,
"maxLevels": 3,
"excludeArray": true,
"disableCrossArrayUnnest": true,
"maxValueLength": 1000,
"includePaths": ["request", "response"],
"skipInvalidJson": true
},
But the behavior will be only ["request", "response"] in json_data field could have json index, but other fields cannot have the json index. This is not what users might want as ["request", "response"] with 5 levels index but rest with 3 levels index.
Abandon the change, will use https://github.com/apache/pinot/pull/14200 instead