pinot icon indicating copy to clipboard operation
pinot copied to clipboard

[SchemaConformingTransformer]Add one more json_data field

Open lnbest0707-uber opened this issue 1 year ago • 5 comments

feature ingestion To ingest json-like data through SchemaConformingTransformer, previously, we could only dump data to either dedicated column or a single json_data field. We may build json index on json_data field to serve queries. However, in reality, users may e.g. dump data like request/response body to the json data and query them. For those special json field, they may want special treatment like more levels of json index. With this patch, among the non-dedicated column data, we will be able to dump them to either default field or a special field filtered by its prefix. And in real use cases, users could choose to build more index or add other special treatments to it.

One example, user might set 3 levels json on the default json_data field but needs some part of json (e.g. request and response) has deeper level index, e.g. 5. Then they could set indexableSpecialFieldPrefix: ["request", "response"] in the SchemaConformingTransformerV2 config, and add json_data_inf_idx as json index in the config with 5 levels.

"schemaConformingTransformerV2Config": {
        "enableIndexableExtras": true,
        "indexableExtrasField": "json_data",
        "enableUnindexableExtras": true,
        "indexableSpecialFieldPrefix": ["request", "response"],
        "indexableSpecialField": "json_data_inf_idx",
        "unindexableExtrasField": "json_data_no_idx",
        "unindexableFieldSuffix": "_noindex",
        "fieldPathsToDrop": [],
        "fieldPathsToSkipStorage": [
          "message"
        ],
        "columnNameToJsonKeyPathMap": {},
        "mergedTextIndexFields": [
          "__mergedTextIndex"
        ],
        "optimizeCaseInsensitiveSearch": false,
        "reverseTextIndexKeyValueOrder": true,
        "mergedTextIndexDocumentMaxLength": 32766,
        "mergedTextIndexBinaryDocumentDetectionMinLength": 512,
        "mergedTextIndexPathToExclude": [
          "_timestampMillisNegative",
          "__mergedTextIndex",
          "_timestampMillis"
        ],
        "fieldsToDoubleIngest": [],
        "jsonKeyValueSeparator": "\u001e",
        "mergedTextIndexBeginOfDocAnchor": "\u0002",
        "mergedTextIndexEndOfDocAnchor": "\u0003",
        "fieldPathsToPreserveInput": [],
        "fieldPathsToPreserveInputWithIndex": []
      },
"jsonIndexConfigs": {
        "json_data": {
          "disabled": false,
          "maxLevels": 3,
          "excludeArray": true,
          "disableCrossArrayUnnest": true,
          "maxValueLength": 1000,
          "skipInvalidJson": true
        },
      "json_data_inf_idx": {
          "disabled": false,
          "maxLevels": 5,
          "excludeArray": true,
          "disableCrossArrayUnnest": true,
          "maxValueLength": 1000,
          "skipInvalidJson": true
        }
      },

lnbest0707-uber avatar Sep 26 '24 22:09 lnbest0707-uber

Codecov Report

Attention: Patch coverage is 70.45455% with 13 lines in your changes missing coverage. Please review.

Project coverage is 63.98%. Comparing base (59551e4) to head (ef3cd01). Report is 1094 commits behind head on master.

Files with missing lines Patch % Lines
...ingestion/SchemaConformingTransformerV2Config.java 0.00% 11 Missing :warning:
...recordtransformer/SchemaConformingTransformer.java 96.29% 0 Missing and 1 partial :warning:
...cordtransformer/SchemaConformingTransformerV2.java 83.33% 1 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14097      +/-   ##
============================================
+ Coverage     61.75%   63.98%   +2.23%     
- Complexity      207     1535    +1328     
============================================
  Files          2436     2596     +160     
  Lines        133233   143330   +10097     
  Branches      20636    21956    +1320     
============================================
+ Hits          82274    91714    +9440     
+ Misses        44911    44866      -45     
- Partials       6048     6750     +702     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) :arrow_up:
integration 100.00% <ø> (+99.99%) :arrow_up:
integration1 100.00% <ø> (+99.99%) :arrow_up:
integration2 0.00% <ø> (ø)
java-11 63.96% <70.45%> (+2.25%) :arrow_up:
java-21 63.87% <70.45%> (+2.24%) :arrow_up:
skip-bytebuffers-false 63.98% <70.45%> (+2.23%) :arrow_up:
skip-bytebuffers-true 63.83% <70.45%> (+36.11%) :arrow_up:
temurin 63.98% <70.45%> (+2.23%) :arrow_up:
unittests 63.98% <70.45%> (+2.23%) :arrow_up:
unittests1 55.55% <0.00%> (+8.65%) :arrow_up:
unittests2 34.48% <70.45%> (+6.75%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Sep 26 '24 22:09 codecov-commenter

However, in reality, users may e.g. dump data like request/response body to the json data and query them. For those special json field, they may want special treatment like more levels of json index.

Can you add one example with data to illustrate the use case?

With this patch, among the non-dedicated column data, we will be able to dump them to either default field or a special field filtered by its prefix. And in real use cases, users could choose to build more index or add other special treatments to it.

One example can be added to explain the new feature added in the PR summary. Are you also proposing another new config to the transfer? If so, please list it in the summary too.

chenboat avatar Oct 01 '24 03:10 chenboat

However, in reality, users may e.g. dump data like request/response body to the json data and query them. For those special json field, they may want special treatment like more levels of json index.

Can you add one example with data to illustrate the use case?

With this patch, among the non-dedicated column data, we will be able to dump them to either default field or a special field filtered by its prefix. And in real use cases, users could choose to build more index or add other special treatments to it.

One example can be added to explain the new feature added in the PR summary. Are you also proposing another new config to the transfer? If so, please list it in the summary too.

Thanks for the review, updated the summary with an example.

lnbest0707-uber avatar Oct 01 '24 06:10 lnbest0707-uber

Why can't we use the includePaths config in Json index configuration option to explicitly index a certain path with any depth?

https://docs.pinot.apache.org/basics/indexing/json-index#enable-and-configure-a-json-index

chenboat avatar Oct 02 '24 16:10 chenboat

Why can't we use the includePaths config in Json index configuration option to explicitly index a certain path with any depth?

https://docs.pinot.apache.org/basics/indexing/json-index#enable-and-configure-a-json-index

IIUC, the includePaths config is specified in the json index config of a field like

"json_data": {
      "disabled": false,
      "maxLevels": 3,
      "excludeArray": true,
      "disableCrossArrayUnnest": true,
      "maxValueLength": 1000,
     "includePaths": ["request", "response"],
      "skipInvalidJson": true
},

But the behavior will be only ["request", "response"] in json_data field could have json index, but other fields cannot have the json index. This is not what users might want as ["request", "response"] with 5 levels index but rest with 3 levels index.

lnbest0707-uber avatar Oct 02 '24 17:10 lnbest0707-uber

Abandon the change, will use https://github.com/apache/pinot/pull/14200 instead

lnbest0707-uber avatar Nov 26 '24 22:11 lnbest0707-uber