solr
solr copied to clipboard
SOLR-16403: Cluster Singleton to remove inactive Shards
https://issues.apache.org/jira/browse/SOLR-16403
Description
This PR proposes a configurable cluster singleton plugin to remove inactive shards. I'd like to get some feedback from the community on whether this is a reasonable thing to do, or if there are other preferred alternatives.
Solution
Adds a Cluster Singleton plugin implementation that can be configured to run periodically and delete inactive shards that have exceeded a TTL period.
Tests
Tests are included. Coverage may be expanded if we move forward
Checklist
Please review the following and check all that apply:
- [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
- [x] I have created a Jira issue and added the issue ID to my pull request title.
- [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
- [x] I have developed this patch against the
mainbranch. - [x] I have run
./gradlew check. - [x] I have added tests for my changes.
- [ ] I have added documentation for the Reference Guide
PR Run Status
- AI Based Review: Successful
- Static Analysis: Failure - Failed to execute static code analysis using FBinfer.
PR Analysis
- Main theme: Implementation of a Cluster Singleton plugin for removing inactive shards in Solr.
- PR summary: This PR introduces a Cluster Singleton plugin, InactiveShardRemover, aimed at identifying and removing inactive shards that have exceeded a Time-To-Live (TTL) period. It configures a scheduled task to periodically check and delete these shards, thereby helping in maintaining the health and efficiency of the Solr cluster.
- Type of PR: Enhancement
- Score (0-100, higher is better): 85
- Relevant tests added: Yes
- Estimated effort to review (1-5, lower is better): 2 The PR is well-structured with a clear purpose. It includes tests which should make the review process straightforward. However, the complexity of the changes and the potential impact on cluster behavior necessitate a thorough review.
PR Feedback
- General suggestions: The implementation of the InactiveShardRemover as a Cluster Singleton for removing inactive shards after a TTL is a valuable addition to Solr's maintenance capabilities. The approach of scheduling periodic checks to remove expired shards can help in keeping the cluster lean and efficient. However, it's crucial to ensure that the removal process does not inadvertently affect the cluster's stability or data integrity. To further enhance this feature, consider implementing safety checks that prevent the removal of shards if they are temporarily marked as inactive due to network partitions or other transient issues. Additionally, providing more configurability for the TTL and deletion criteria could allow administrators to tailor the feature to their specific needs. Finally, extensive testing in simulated production environments is recommended to ensure that the feature behaves as expected under various failure scenarios.
Code feedback
file: solr/core/src/java/org/apache/solr/cluster/maintenance/InactiveShardRemover.java
- Suggestions:
- Consider implementing a dry-run mode that logs which shards would be deleted without actually performing the deletion. This can be useful for administrators to safely test the impact of the InactiveShardRemover on their cluster. [important] Relevant line:In solr/core/src/java/org/apache/solr/cluster/maintenance/InactiveShardRemover.java at line 85
+ implements ClusterSingleton, ConfigurablePlugin<InactiveShardRemoverConfig> {
- Add metrics or logging to track the number of shards checked and deleted per cycle. This can help in monitoring the effectiveness of the plugin and troubleshooting issues. [important] Relevant line:In solr/core/src/java/org/apache/solr/cluster/maintenance/InactiveShardRemover.java at line 176
+ inactiveSlices.stream().limit(maxDeletesPerCycle).forEach(this::deleteShard);
- Ensure that the scheduled executor service is properly managed to avoid potential memory leaks. Consider using a try-with-resources statement or similar mechanism to ensure that resources are freed up. [important] Relevant line:In solr/core/src/java/org/apache/solr/cluster/maintenance/InactiveShardRemover.java at line 132
+ executor = Executors.newScheduledThreadPool(1, new SolrNamedThreadFactory(PLUGIN_NAME));
- Consider adding a feature to notify administrators before deleting shards, possibly through email or a dashboard notification. This could provide an additional layer of safety before data deletion occurs. [medium] Relevant line:In solr/core/src/java/org/apache/solr/cluster/maintenance/InactiveShardRemover.java at line 190
+ private void deleteShard(final Slice s) {
file: solr/core/src/java/org/apache/solr/cluster/maintenance/InactiveShardRemoverConfig.java
- Suggestions:
- Validate configuration parameters (e.g., scheduleIntervalSeconds, ttlSeconds, maxDeletesPerCycle) to ensure they are within reasonable limits. This can prevent misconfigurations that could lead to aggressive shard deletions or performance issues. [important] Relevant line:In solr/core/src/java/org/apache/solr/cluster/maintenance/InactiveShardRemoverConfig.java at line 47
+ final long scheduleIntervalSeconds, final long ttlSeconds, final int maxDeletesPerCycle) {
- Allow dynamic reconfiguration of the InactiveShardRemoverConfig without needing to restart the Solr cluster. This would enable administrators to adjust settings in response to the cluster's current state. [medium] Relevant line:In solr/core/src/java/org/apache/solr/cluster/maintenance/InactiveShardRemoverConfig.java at line 28
+ public InactiveShardRemoverConfig() {
file: solr/core/src/test/org/apache/solr/cluster/maintenance/InactiveShardRemoverTest.java
- Suggestions:
- Expand test coverage to include scenarios where shards become inactive due to network partitions or other transient conditions. Ensure that the plugin does not delete these shards prematurely. [important] Relevant line:In solr/core/src/test/org/apache/solr/cluster/maintenance/InactiveShardRemoverTest.java at line 210
+ private void setAllShardsInactive(final String collectionName) {
We now have this running in production and have been able to see it working as expected, so I'm taking the PR out of draft status. @dsmiley fyi.
Test failure is reproducible on the main branch with
gradlew :solr:core:test --tests "org.apache.solr.cloud.OverseerStatusTest.test" -Ptests.jvms=96 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=9491A2C35A439099 -Ptests.timeoutSuite=600000! -Ptests.file.encoding=US-ASCII
I already reported org.apache.solr.cloud.OverseerStatusTest.test -- https://github.com/apache/solr/pull/2230#issuecomment-1984524318
BTW this new BitoAgent AI bot reviewer thing is wild. Not sure it's that helpful.
For all the AI fanciness of that bot, some good old fashioned static analysis is all we would have liked to point out that we are adding methods that are not used (leading to test flapper and follow-up PR). I suspect the "Muse" bot that was cancelled ~6 months ago would have caught this.