pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Fix large partition id for single stream

Open Jackie-Jiang opened this issue 3 months ago • 8 comments

When multiple stream support was added in #13790, it assumes the partition id for a given stream can never go over 10000. This assumption cannot be held for certain custom stream plugins.

#15957 fixed the problem by always checking if it is single stream before applying the modulo operation in IngestionConfigUtils (getStreamPartitionIdFromPinotPartitionId and getStreamConfigIndexFromPinotPartitionId).

Several recent PRs break this again, and we need to fix it:

  • #16783 @noob-se7en
  • #16833 @lnbest0707-uber
  • #17163 @noob-se7en
  • #17217 @tarun11Mavani

In the meanwhile, we should consider adding a test for this scenario.

@jadami10 Please confirm if you are still using a plugin that can generate large partition id

Jackie-Jiang avatar Nov 22 '25 01:11 Jackie-Jiang

See #17261 for example of fix

Jackie-Jiang avatar Nov 22 '25 02:11 Jackie-Jiang

Yes, our plugin uses the full 32 bit range, so we have numbers much greater than 10,000.

Cc @priyen-stripe since you've been looking at an upgrade.

jadami10 avatar Nov 22 '25 20:11 jadami10

@tarun11Mavani : can you review this and confirm if our internal use-case can continue to work on master?

ankitsultana avatar Nov 26 '25 02:11 ankitsultana

Can we also add a test scenario for the topic data migration? This also happens many time in the production, e.g. topic migration from large number of partitions to smaller number.

Assume we are switching kafka topic from topicA to topicB: 0. Create both kafka topics: topicA(128 partitions) and topicB(64 partitions), the table is setup with topicA.

For seamless, no downtime nor data loss migration:

  1. Add Kafka topicB into the table stream conf, ensure it's existence but no data coming so far
  2. Cut the kafka producer from topicA to topicB
  3. Wait for topicA data persisted in the table
  4. Remove topicA from the stream configs

xiangfu0 avatar Dec 01 '25 00:12 xiangfu0

When multiple stream support was added in https://github.com/apache/pinot/pull/13790, it assumes the partition id for a given stream can never go over 10000. This assumption cannot be held for certain custom stream plugins.

I was unaware that the above assumption was wrong while creating the follow up PRs and went ahead based on this comment:

https://github.com/apache/pinot/blame/8bfb907db9eeb29ea3f494f435aa2fa781378db1/pinot-spi/src/main/java/org/apache/pinot/spi/utils/IngestionConfigUtils.java#L58-L60

  // For partition from different topics, we pad then with an offset to avoid collision. The offset is far higher
  // than the normal max number of partitions on stream (e.g. 512).
  public static final int PARTITION_PADDING_OFFSET = 10000;

https://github.com/apache/pinot/pull/15957 fixed the problem by always checking if it is single stream before applying the modulo operation in

But this does not fix the actual problem right?

I am not sure if the padding approach is the best. Considering that the issue is with the LLCSegmentName naming convention here, maybe we should explore a different naming strategy for multi-stream tables.

noob-se7en avatar Dec 03 '25 20:12 noob-se7en

But this does not fix the actual problem right?

Agree. This is just a workaround to not break single stream case, which is much more common than multiple streams.

I am not sure if the padding approach is the best. Considering that the issue is with the LLCSegmentName naming convention here, maybe we should explore a different naming strategy for multi-stream tables.

+1 on this. The root problem is to overloading partition id to represent multiple streams. cc @lnbest0707-uber @ankitsultana Given the above problems, what's your opinion?

Jackie-Jiang avatar Dec 09 '25 00:12 Jackie-Jiang

But this does not fix the actual problem right?

Agree. This is just a workaround to not break single stream case, which is much more common than multiple streams.

I am not sure if the padding approach is the best. Considering that the issue is with the LLCSegmentName naming convention here, maybe we should explore a different naming strategy for multi-stream tables.

+1 on this. The root problem is to overloading partition id to represent multiple streams. cc @lnbest0707-uber @ankitsultana Given the above problems, what's your opinion?

There was one proposal to add topic name into the segment name but the community were a bit hesitated with it. On one side, it introduced many changes to the existing code. On the other side, it may increase the length of segment name and size of ideal state. Especially the ideal state, it could be very large with the uncontrolled kafka topic names. Maybe the solution could be slightly changing the segment naming format + keeping the index of the topic only.

lnbest0707 avatar Dec 09 '25 04:12 lnbest0707

yeah, was thinking something similar like storing all the topic names of the table in ZkMetadata as list and using the index of those topic names while generating the segment name.

noob-se7en avatar Dec 09 '25 18:12 noob-se7en