pinot [SchemaConformingTransformer]Add one more json

feature ingestion To ingest json-like data through SchemaConformingTransformer, previously, we could only dump data to either dedicated column or a single json_data field. We may build json index on json_data field to serve queries. However, in reality, users may e.g. dump data like request/response body to the json data and query them. For those special json field, they may want special treatment like more levels of json index. With this patch, among the non-dedicated column data, we will be able to dump them to either default field or a special field filtered by its prefix. And in real use cases, users could choose to build more index or add other special treatments to it.

One example, user might set 3 levels json on the default json_data field but needs some part of json (e.g. request and response) has deeper level index, e.g. 5. Then they could set indexableSpecialFieldPrefix: ["request", "response"] in the SchemaConformingTransformerV2 config, and add json_data_inf_idx as json index in the config with 5 levels.

"schemaConformingTransformerV2Config": {
        "enableIndexableExtras": true,
        "indexableExtrasField": "json_data",
        "enableUnindexableExtras": true,
        "indexableSpecialFieldPrefix": ["request", "response"],
        "indexableSpecialField": "json_data_inf_idx",
        "unindexableExtrasField": "json_data_no_idx",
        "unindexableFieldSuffix": "_noindex",
        "fieldPathsToDrop": [],
        "fieldPathsToSkipStorage": [
          "message"
        ],
        "columnNameToJsonKeyPathMap": {},
        "mergedTextIndexFields": [
          "__mergedTextIndex"
        ],
        "optimizeCaseInsensitiveSearch": false,
        "reverseTextIndexKeyValueOrder": true,
        "mergedTextIndexDocumentMaxLength": 32766,
        "mergedTextIndexBinaryDocumentDetectionMinLength": 512,
        "mergedTextIndexPathToExclude": [
          "_timestampMillisNegative",
          "__mergedTextIndex",
          "_timestampMillis"
        ],
        "fieldsToDoubleIngest": [],
        "jsonKeyValueSeparator": "\u001e",
        "mergedTextIndexBeginOfDocAnchor": "\u0002",
        "mergedTextIndexEndOfDocAnchor": "\u0003",
        "fieldPathsToPreserveInput": [],
        "fieldPathsToPreserveInputWithIndex": []
      },
"jsonIndexConfigs": {
        "json_data": {
          "disabled": false,
          "maxLevels": 3,
          "excludeArray": true,
          "disableCrossArrayUnnest": true,
          "maxValueLength": 1000,
          "skipInvalidJson": true
        },
      "json_data_inf_idx": {
          "disabled": false,
          "maxLevels": 5,
          "excludeArray": true,
          "disableCrossArrayUnnest": true,
          "maxValueLength": 1000,
          "skipInvalidJson": true
        }
      },

Sep 26 '24 22:09 lnbest0707-uber

Codecov Report

Attention: Patch coverage is 70.45455% with 13 lines in your changes missing coverage. Please review.

Project coverage is 63.98%. Comparing base (59551e4) to head (ef3cd01). Report is 1094 commits behind head on master.

Files with missing lines	Patch %	Lines
...ingestion/SchemaConformingTransformerV2Config.java	0.00%	11 Missing :warning:
...recordtransformer/SchemaConformingTransformer.java	96.29%	0 Missing and 1 partial :warning:
...cordtransformer/SchemaConformingTransformerV2.java	83.33%	1 Missing :warning:

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14097      +/-   ##
============================================
+ Coverage     61.75%   63.98%   +2.23%     
- Complexity      207     1535    +1328     
============================================
  Files          2436     2596     +160     
  Lines        133233   143330   +10097     
  Branches      20636    21956    +1320     
============================================
+ Hits          82274    91714    +9440     
+ Misses        44911    44866      -45     
- Partials       6048     6750     +702

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	:arrow_up:
integration	`100.00% <ø> (+99.99%)`	:arrow_up:
integration1	`100.00% <ø> (+99.99%)`	:arrow_up:
integration2	`0.00% <ø> (ø)`
java-11	`63.96% <70.45%> (+2.25%)`	:arrow_up:
java-21	`63.87% <70.45%> (+2.24%)`	:arrow_up:
skip-bytebuffers-false	`63.98% <70.45%> (+2.23%)`	:arrow_up:
skip-bytebuffers-true	`63.83% <70.45%> (+36.11%)`	:arrow_up:
temurin	`63.98% <70.45%> (+2.23%)`	:arrow_up:
unittests	`63.98% <70.45%> (+2.23%)`	:arrow_up:
unittests1	`55.55% <0.00%> (+8.65%)`	:arrow_up:
unittests2	`34.48% <70.45%> (+6.75%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Sep 26 '24 22:09 codecov-commenter

However, in reality, users may e.g. dump data like request/response body to the json data and query them. For those special json field, they may want special treatment like more levels of json index.

Can you add one example with data to illustrate the use case?

With this patch, among the non-dedicated column data, we will be able to dump them to either default field or a special field filtered by its prefix. And in real use cases, users could choose to build more index or add other special treatments to it.

One example can be added to explain the new feature added in the PR summary. Are you also proposing another new config to the transfer? If so, please list it in the summary too.

Oct 01 '24 03:10 chenboat

However, in reality, users may e.g. dump data like request/response body to the json data and query them. For those special json field, they may want special treatment like more levels of json index.

Can you add one example with data to illustrate the use case?

With this patch, among the non-dedicated column data, we will be able to dump them to either default field or a special field filtered by its prefix. And in real use cases, users could choose to build more index or add other special treatments to it.

One example can be added to explain the new feature added in the PR summary. Are you also proposing another new config to the transfer? If so, please list it in the summary too.

Thanks for the review, updated the summary with an example.

Oct 01 '24 06:10 lnbest0707-uber

Why can't we use the includePaths config in Json index configuration option to explicitly index a certain path with any depth?

https://docs.pinot.apache.org/basics/indexing/json-index#enable-and-configure-a-json-index

Oct 02 '24 16:10 chenboat

Why can't we use the includePaths config in Json index configuration option to explicitly index a certain path with any depth?

https://docs.pinot.apache.org/basics/indexing/json-index#enable-and-configure-a-json-index

IIUC, the includePaths config is specified in the json index config of a field like

"json_data": {
      "disabled": false,
      "maxLevels": 3,
      "excludeArray": true,
      "disableCrossArrayUnnest": true,
      "maxValueLength": 1000,
     "includePaths": ["request", "response"],
      "skipInvalidJson": true
},

But the behavior will be only ["request", "response"] in json_data field could have json index, but other fields cannot have the json index. This is not what users might want as ["request", "response"] with 5 levels index but rest with 3 levels index.

Oct 02 '24 17:10 lnbest0707-uber

Abandon the change, will use https://github.com/apache/pinot/pull/14200 instead

Nov 26 '24 22:11 lnbest0707-uber

pinot
pinot copied to clipboard

[SchemaConformingTransformer]Add one more json_data field

Codecov Report

pinot pinot copied to clipboard

[SchemaConformingTransformer]Add one more json_data field

Codecov Report

pinot
pinot copied to clipboard