pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Add multi stream ingestion support

Open lnbest0707 opened this issue 1 year ago • 1 comments

feature Reference: https://github.com/apache/pinot/issues/13780 Design Doc

Please refer to design doc for details. TLDR:

  • Add support to ingest from multiple source by a single table
  • Use existing interface (TableConfig) to define multiple streams
  • Separate the partition id definition between Stream and Pinot segment
  • Compatible with existing stream partition auto expansion logics

Feature tested on multiple Kafka topics with different decoder format. Due to resource limitations, not able to test other upstream source e2e.

Some TODOs:

  • Validation and Limitation on multiple stream configs.
  • Standardize the usage of StreamConfig object. e.g. some are only used to get non-topic related static metadata, should use other interface.
  • Adding/removing stream support or sanity check.

lnbest0707 avatar Aug 09 '24 23:08 lnbest0707

Codecov Report

Attention: Patch coverage is 63.21839% with 64 lines in your changes missing coverage. Please review.

Project coverage is 64.03%. Comparing base (59551e4) to head (d8f46da). Report is 1483 commits behind head on master.

Files with missing lines Patch % Lines
...g/apache/pinot/spi/utils/IngestionConfigUtils.java 53.33% 11 Missing and 3 partials :warning:
...inot/spi/stream/PartitionGroupMetadataFetcher.java 65.71% 11 Missing and 1 partial :warning:
.../core/realtime/PinotLLCRealtimeSegmentManager.java 84.21% 9 Missing :warning:
...apache/pinot/controller/BaseControllerStarter.java 0.00% 5 Missing :warning:
...x/core/realtime/MissingConsumingSegmentFinder.java 44.44% 5 Missing :warning:
...r/validation/RealtimeSegmentValidationManager.java 0.00% 5 Missing :warning:
...he/pinot/segment/local/utils/TableConfigUtils.java 50.00% 3 Missing and 2 partials :warning:
...roller/helix/core/PinotTableIdealStateBuilder.java 0.00% 3 Missing :warning:
.../helix/core/realtime/SegmentCompletionManager.java 0.00% 2 Missing :warning:
...a/manager/realtime/RealtimeSegmentDataManager.java 87.50% 1 Missing :warning:
... and 3 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13790      +/-   ##
============================================
+ Coverage     61.75%   64.03%   +2.27%     
- Complexity      207     1605    +1398     
============================================
  Files          2436     2703     +267     
  Lines        133233   149053   +15820     
  Branches      20636    22849    +2213     
============================================
+ Hits          82274    95439   +13165     
- Misses        44911    46620    +1709     
- Partials       6048     6994     +946     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) :arrow_up:
integration 100.00% <ø> (+99.99%) :arrow_up:
integration1 100.00% <ø> (+99.99%) :arrow_up:
integration2 0.00% <ø> (ø)
java-11 63.95% <63.21%> (+2.24%) :arrow_up:
java-21 63.92% <63.21%> (+2.30%) :arrow_up:
skip-bytebuffers-false 63.97% <63.21%> (+2.22%) :arrow_up:
skip-bytebuffers-true 63.90% <63.21%> (+36.17%) :arrow_up:
temurin 64.03% <63.21%> (+2.27%) :arrow_up:
unittests 64.02% <63.21%> (+2.27%) :arrow_up:
unittests1 56.29% <33.70%> (+9.40%) :arrow_up:
unittests2 34.47% <52.29%> (+6.74%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Aug 10 '24 00:08 codecov-commenter

Add the production running report: The feature has been running in Uber production environment with PBs of data for months. There are hundreds of Pinot tables created. One table can be created with 20-30+ topics ingested with no issues. The overall ingestion and query performance is also competitive with the common single topic ingestions.

Notes: the feature is running with Kafka's multiple topics ingestion. We do not have resources to run it with other or multiple type of streams.

lnbest0707 avatar Nov 14 '24 00:11 lnbest0707

Thanks @itschrispeck for identifying a edge case issue and proposing the fix in the commit https://github.com/apache/pinot/pull/13790/commits/6bd1307c9249ce73232865cfe4edcc967b058111 to address the missing segments issue if cannot fetch partition metadata from the stream.

lnbest0707 avatar Nov 27 '24 01:11 lnbest0707

@Jackie-Jiang could you pls review again and see if we still have blockers before merging? Thanks

lnbest0707 avatar Dec 18 '24 02:12 lnbest0707

Thanks again for contributing this feature. Is there a user doc associated with this feature

kishoreg avatar Dec 18 '24 03:12 kishoreg

Thanks again for contributing this feature. Is there a user doc associated with this feature

@kishoreg thanks for bringing this up. After the PR merged, I would update the pinot-doc with the feature and its usage. In general, users could use the exact same way to define the table config with existing interfaces. I would provide an example in PR description.

lnbest0707 avatar Dec 18 '24 18:12 lnbest0707