solr
solr copied to clipboard
SOLR-16116: Use apache curator to manage the Solr Zookeeper interactions
https://issues.apache.org/jira/browse/SOLR-16116
This is still a WIP there are two outstanding issues:
- [x] The hadoop-auth framework requires a different version of curator
- [x] The leader election code is slightly broken, because the OnReconnect logic in ZKController expects to be called at different times, with different guarantees of watchers and ephemeral nodes than Zookeeper gives.
Might look into using curator recipes for persistent watchers and leader election, but this is a fairly complete migration to start with.
leader election code is not broken, it was that the sessionTimeout was not being populated in SolrZkClient.
@HoustonPutman I haven't looked super closely at this, but I think that Hadoop shades curator when needed. Its possible that the Solr Hadoop classes just need to have the imports updated to use the Hadoop specific shaded curator paths. This is new with the Hadoop auth and HDFS modules in Solr when we upgraded Hadoop. I tried to switch to the Hadoop self contained shaded dependencies to avoid issues like this.
@risdenk I believe the issue currently is that for hadoop-auth we are not able to rely solely on hadoop-client-*
dependencies, which are the ones that would use a relocated curator. All of the auth filter and kerberos logic and whatever else we borrow from hadoop is annotated with audience LimitedPrivate
I think, so they didn't bother making a consumable client module for people to use.
Thanks for the comment @risdenk . I was talking to @madrob about it, but he hinted that there were other problems with using the shaded dependencies. However that might no longer be applicable, so I will definitely give it a shot in a week or month (things getting a bit busy haha).
I propose we make the loss of the reconnect boolean its own PR, plus the methods that shielded use of ZkCmdExecutor. That will make this effort here more independently reviewable and would be nice in its own right.
hadoop 3.3.6 upgraded curator to 5.2.0 - https://github.com/apache/solr/pull/1743. I'm working on getting that in. chatted w/ @HoustonPutman about it as well
https://github.com/apache/solr/pull/1743 is in main and so this PR should be ready to update :D
I merged main (after hadoop 3.3.6/curator 5.x) and then fixed up different things after that to get all the tests passing including nightly. Would appreciate another set of eyes @HoustonPutman @dsmiley
I'll look through this again to see if things can be cleaned up more, but hopefully its progress.
How come the bats tests didn't run? I notice that only precommit, solr and solrJ tests run...
I propose we make the loss of the reconnect boolean its own PR, plus the methods that shielded use of ZkCmdExecutor. That will make this effort here more independently reviewable and would be nice in its own right.
+1 to David's suggestion; this diff is pretty unwieldy. It's not too late to do this either -- we could just add 2 commits: one removing the strictly mechanical changes and then another commit immediately reverting the removal. Simply having the commits would allow creative diff'ing and patch-applying to make the PR easier to review. I'll probably do this locally anyway just to facilitate my own reviewing.
I can take care of that hopefully today.
@magibney @dsmiley I pushed the revert of retryOnConnLoss - basically its just ignored now in 44399c7
I have a local branch that I'm going to remove retryOnConnLoss
(deprecated on branch_9x) but remove on main.
edit: pushed that removal commit here - https://github.com/apache/solr/pull/2004 just so it doesn't get lost.
This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the [email protected] mailing list. Thank you for your contribution!
It would be a shame if this never makes it in.... Should hadoop-auth be removed in 10x? Or moved to it's own github project in 10x?
AFAIK, ever since #1743 landed, we're no longer blocked on Hadoop-Auth
Yeah let's get this in main. Probably best to not introduce it in a late 9x version
@HoustonPutman @dsmiley I'm happy to merge main into this again and resolve any conflicts. can you do another review to make sure all is good before merging to main?
This is an important initiative. If you've given up on it @risdenk, let us know so someone can take over.
This is an important initiative. If you've given up on it @risdenk, let us know so someone can take over.
I'm not actively working on it. @HoustonPutman started this and I just tried to push it a bit further. I might get back to it but nothing immediately. Someone else can definitely push it further.
Thanks for putting in the work @risdenk ! I will push to add this for 10x only this summer.