Remediate ingestion failures when number of segments in time period is larger than 32767
Fixes https://github.com/apache/druid/issues/15091.
Description
This PR is attempting to remediate the exception java.lang.IllegalArgumentException: fromKey > toKey when number of segments is larger than Java Short.MAX_VALUE 32767, without fully context on why we set a limit on number of segments in time period to be within range of Java Short, this is what I believe that could make ingestion keep going
I can send a different PR if it is appropriate to change the number of segments to be in range of Int intead of Short, which requires larger scope of changes.
Fixed the bug
https://github.com/apache/druid/issues/15091.
Renamed the class
None
Added a forbidden-apis entry ...
None
Release note
Prevent ingestion failures when number of segments in time period exceeds 32767
Key changed/added classes in this PR
OvershadowableManager
This PR has:
- [x] been self-reviewed.
- [ ] added documentation for new or modified features or behaviors.
- [x] a release note entry in the PR description.
- [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in licenses.yaml
- [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
Let me know if it is appropriate to refactor partitionId to Int instead of short, which I believe solve this problem more completely.
Hi @kfaraz @abhishekagarwal87 sorry for being verbose on thread, I hope I had our issue well described , please let me know your thoughts, I`m open to any solutions that can keep our ingestion alive without deleting historical data.
@dulu98Kurz Thank you for the patience. Since most of the committers are busy with druid 28 things there might be some delay.
I think we have a ugly fail safe here. What I would prefer is to change the exception message to something nicer so that the user does not need to read through the druid's code base.
Like @abhishekagarwal87 32K partitions per interval just seem to massive and is generally an issue where in either compaction is lagging behind or late data is coming. If we change this variable to int, the ticking time bomb has a larger impact and can lead to a full cluster outage where the number of segments in the cluster balloons up in millions.
I think if we do want to change this to INT, we should add a guard rail to the number of partitions per interval from the implicit 32K to something larger maybe 50K ?
@cryptoe thanks for checking on this! Agreed, letting the segments number growing wildly to the Integer MAX would definitely accumulate a bigger problem.
I`ll update the PR to include a CAP of 50K or 65536(if we want to go by binary convention), also making the exception message more meaningful, will send update soon, thanks again for attention! @cryptoe
Hi @cryptoe @abhishekagarwal87 @kfaraz , please find updates in PR (https://github.com/apache/druid/pull/15116) , it includes changes as per discussion:
- More informative exception language
- Refactored partitionId in
RootPartitionRangefromshorttoint, but with a max value of 65536 to prevent memory pressure out of control. - Removed unnecessary problematic
shorttointconversion.
Please let me know for any concerns, I`ll close this PR when https://github.com/apache/druid/pull/15116 is merged.
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.
This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.