druid icon indicating copy to clipboard operation
druid copied to clipboard

Remediate ingestion failures when number of segments in time period is larger than 32767

Open dulu98Kurz opened this issue 2 years ago • 6 comments

Fixes https://github.com/apache/druid/issues/15091.

Description

This PR is attempting to remediate the exception java.lang.IllegalArgumentException: fromKey > toKey when number of segments is larger than Java Short.MAX_VALUE 32767, without fully context on why we set a limit on number of segments in time period to be within range of Java Short, this is what I believe that could make ingestion keep going I can send a different PR if it is appropriate to change the number of segments to be in range of Int intead of Short, which requires larger scope of changes.

Fixed the bug

https://github.com/apache/druid/issues/15091.

Renamed the class

None

Added a forbidden-apis entry ...

None

Release note

Prevent ingestion failures when number of segments in time period exceeds 32767


Key changed/added classes in this PR
  • OvershadowableManager

This PR has:

  • [x] been self-reviewed.
  • [ ] added documentation for new or modified features or behaviors.
  • [x] a release note entry in the PR description.
  • [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • [ ] added or updated version, license, or notice information in licenses.yaml
  • [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • [ ] added integration tests.
  • [ ] been tested in a test Druid cluster.

dulu98Kurz avatar Oct 04 '23 17:10 dulu98Kurz

Let me know if it is appropriate to refactor partitionId to Int instead of short, which I believe solve this problem more completely.

dulu98Kurz avatar Oct 04 '23 21:10 dulu98Kurz

Hi @kfaraz @abhishekagarwal87 sorry for being verbose on thread, I hope I had our issue well described , please let me know your thoughts, I`m open to any solutions that can keep our ingestion alive without deleting historical data.

dulu98Kurz avatar Oct 16 '23 06:10 dulu98Kurz

@dulu98Kurz Thank you for the patience. Since most of the committers are busy with druid 28 things there might be some delay.

I think we have a ugly fail safe here. What I would prefer is to change the exception message to something nicer so that the user does not need to read through the druid's code base.

Like @abhishekagarwal87 32K partitions per interval just seem to massive and is generally an issue where in either compaction is lagging behind or late data is coming. If we change this variable to int, the ticking time bomb has a larger impact and can lead to a full cluster outage where the number of segments in the cluster balloons up in millions.

I think if we do want to change this to INT, we should add a guard rail to the number of partitions per interval from the implicit 32K to something larger maybe 50K ?

cryptoe avatar Oct 19 '23 15:10 cryptoe

@cryptoe thanks for checking on this! Agreed, letting the segments number growing wildly to the Integer MAX would definitely accumulate a bigger problem.

I`ll update the PR to include a CAP of 50K or 65536(if we want to go by binary convention), also making the exception message more meaningful, will send update soon, thanks again for attention! @cryptoe

dulu98Kurz avatar Oct 20 '23 01:10 dulu98Kurz

Hi @cryptoe @abhishekagarwal87 @kfaraz , please find updates in PR (https://github.com/apache/druid/pull/15116) , it includes changes as per discussion:

  1. More informative exception language
  2. Refactored partitionId in RootPartitionRange from short to int, but with a max value of 65536 to prevent memory pressure out of control.
  3. Removed unnecessary problematic short to int conversion.

Please let me know for any concerns, I`ll close this PR when https://github.com/apache/druid/pull/15116 is merged.

dulu98Kurz avatar Oct 21 '23 08:10 dulu98Kurz

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Mar 07 '24 00:03 github-actions[bot]

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Apr 05 '24 00:04 github-actions[bot]