Adds intermediate dataType to schema and use it for ingestion aggregation
Problem
Related to https://github.com/apache/pinot/issues/16317 . TLDR: When the ingestion aggregation/tranformation happens on source column not present in schema, There can be exceptions thrown which occur from data type conversions since there is no info of those source column as they are not present in the schema.
Example: Ingestion aggregation: sum(price) , Here if price column is not part of schema, Pinot assumes it to be as Number but it can be String in source.
PR Add new intermediate field type like below to schema and use this info in ingestion aggregation.
"intermediateFieldSpecs": [
{
"name": "price",
"dataType": "STRING"
}
],
Pending Adding more tests. Opening this PR to get early reviews.
Codecov Report
:x: Patch coverage is 53.74150% with 68 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 63.28%. Comparing base (2eeecc5) to head (a22c5cf).
Additional details and impacted files
@@ Coverage Diff @@
## master #16868 +/- ##
============================================
- Coverage 63.30% 63.28% -0.02%
Complexity 1474 1474
============================================
Files 3155 3157 +2
Lines 188119 188223 +104
Branches 28792 28805 +13
============================================
+ Hits 119088 119121 +33
- Misses 59800 59857 +57
- Partials 9231 9245 +14
| Flag | Coverage Δ | |
|---|---|---|
| custom-integration1 | 100.00% <ø> (ø) |
|
| integration | 100.00% <ø> (ø) |
|
| integration1 | 100.00% <ø> (ø) |
|
| integration2 | 0.00% <ø> (ø) |
|
| java-11 | 63.25% <53.74%> (-0.03%) |
:arrow_down: |
| java-21 | 63.26% <53.74%> (+7.64%) |
:arrow_up: |
| temurin | 63.28% <53.74%> (-0.02%) |
:arrow_down: |
| unittests | 63.28% <53.74%> (-0.02%) |
:arrow_down: |
| unittests1 | 55.61% <26.53%> (-0.05%) |
:arrow_down: |
| unittests2 | 34.00% <51.70%> (+0.03%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
@Jackie-Jiang added intermediate field spec in schema:
Like:
"intermediateFieldSpecs": [
{
"name": "random",
"dataType": "STRING"
}
],
@noob-se7en
- Will it impact segment reload (due to schema change ) etc?
- It's impact on existing segments: Given that these are transformation at the time of ingestion, were we failing segment build for such scenarios (referring to the issues mentioned above) ?
- It's impact on pauseless ingestion i.e. scenarios of continued ingestion without segment build. Will we rely on DR here ?
- How are we handling transformations for such scenarios ? Is the expectation that the column being transformed is part of the schema.
@noob-se7en
Will it impact segment reload (due to schema change ) etc?
- It's impact on existing segments: Given that these are transformation at the time of ingestion, were we failing segment build for such scenarios (referring to the issues mentioned above) ?
- It's impact on pauseless ingestion i.e. scenarios of continued ingestion without segment build. Will we rely on DR here ?
How are we handling transformations for such scenarios ? Is the expectation that the column being transformed is part of the schema.
I guess for transformation the ingestion itself, at row level, will throw exceptions and we won't wait till the segment build ?
@noob-se7en
Will it impact segment reload (due to schema change ) etc?
- It's impact on existing segments: Given that these are transformation at the time of ingestion, were we failing segment build for such scenarios (referring to the issues mentioned above) ?
- It's impact on pauseless ingestion i.e. scenarios of continued ingestion without segment build. Will we rely on DR here ?
How are we handling transformations for such scenarios ? Is the expectation that the column being transformed is part of the schema.
I don't understand the questions fully. Code changes are only in MutableSegmentImpl. It should not impact reload of segments right?
This PR is only meant for supporting realtime ingestion aggregation (which happens during indexing of mutable segments)