solr icon indicating copy to clipboard operation
solr copied to clipboard

SOLR-16116: Use apache curator to manage the Solr Zookeeper interactions

Open HoustonPutman opened this issue 2 years ago • 23 comments

https://issues.apache.org/jira/browse/SOLR-16116

This is still a WIP there are two outstanding issues:

  • [x] The hadoop-auth framework requires a different version of curator
  • [x] The leader election code is slightly broken, because the OnReconnect logic in ZKController expects to be called at different times, with different guarantees of watchers and ephemeral nodes than Zookeeper gives.

Might look into using curator recipes for persistent watchers and leader election, but this is a fairly complete migration to start with.

HoustonPutman avatar Mar 24 '22 17:03 HoustonPutman

leader election code is not broken, it was that the sessionTimeout was not being populated in SolrZkClient.

HoustonPutman avatar Mar 24 '22 21:03 HoustonPutman

@HoustonPutman I haven't looked super closely at this, but I think that Hadoop shades curator when needed. Its possible that the Solr Hadoop classes just need to have the imports updated to use the Hadoop specific shaded curator paths. This is new with the Hadoop auth and HDFS modules in Solr when we upgraded Hadoop. I tried to switch to the Hadoop self contained shaded dependencies to avoid issues like this.

risdenk avatar Apr 12 '22 13:04 risdenk

@risdenk I believe the issue currently is that for hadoop-auth we are not able to rely solely on hadoop-client-* dependencies, which are the ones that would use a relocated curator. All of the auth filter and kerberos logic and whatever else we borrow from hadoop is annotated with audience LimitedPrivate I think, so they didn't bother making a consumable client module for people to use.

madrob avatar Apr 12 '22 14:04 madrob

Thanks for the comment @risdenk . I was talking to @madrob about it, but he hinted that there were other problems with using the shaded dependencies. However that might no longer be applicable, so I will definitely give it a shot in a week or month (things getting a bit busy haha).

HoustonPutman avatar Apr 12 '22 14:04 HoustonPutman

I propose we make the loss of the reconnect boolean its own PR, plus the methods that shielded use of ZkCmdExecutor. That will make this effort here more independently reviewable and would be nice in its own right.

dsmiley avatar Feb 19 '23 00:02 dsmiley

hadoop 3.3.6 upgraded curator to 5.2.0 - https://github.com/apache/solr/pull/1743. I'm working on getting that in. chatted w/ @HoustonPutman about it as well

risdenk avatar Oct 03 '23 19:10 risdenk

https://github.com/apache/solr/pull/1743 is in main and so this PR should be ready to update :D

risdenk avatar Oct 04 '23 23:10 risdenk

I merged main (after hadoop 3.3.6/curator 5.x) and then fixed up different things after that to get all the tests passing including nightly. Would appreciate another set of eyes @HoustonPutman @dsmiley

I'll look through this again to see if things can be cleaned up more, but hopefully its progress.

risdenk avatar Oct 10 '23 13:10 risdenk

How come the bats tests didn't run? I notice that only precommit, solr and solrJ tests run...

epugh avatar Oct 10 '23 19:10 epugh

I propose we make the loss of the reconnect boolean its own PR, plus the methods that shielded use of ZkCmdExecutor. That will make this effort here more independently reviewable and would be nice in its own right.

+1 to David's suggestion; this diff is pretty unwieldy. It's not too late to do this either -- we could just add 2 commits: one removing the strictly mechanical changes and then another commit immediately reverting the removal. Simply having the commits would allow creative diff'ing and patch-applying to make the PR easier to review. I'll probably do this locally anyway just to facilitate my own reviewing.

magibney avatar Oct 11 '23 15:10 magibney

I can take care of that hopefully today.

risdenk avatar Oct 11 '23 15:10 risdenk

@magibney @dsmiley I pushed the revert of retryOnConnLoss - basically its just ignored now in 44399c7

I have a local branch that I'm going to remove retryOnConnLoss (deprecated on branch_9x) but remove on main.

edit: pushed that removal commit here - https://github.com/apache/solr/pull/2004 just so it doesn't get lost.

risdenk avatar Oct 11 '23 19:10 risdenk

This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the [email protected] mailing list. Thank you for your contribution!

github-actions[bot] avatar Mar 10 '24 00:03 github-actions[bot]

It would be a shame if this never makes it in.... Should hadoop-auth be removed in 10x? Or moved to it's own github project in 10x?

epugh avatar Mar 10 '24 18:03 epugh

AFAIK, ever since #1743 landed, we're no longer blocked on Hadoop-Auth

dsmiley avatar Mar 11 '24 12:03 dsmiley

Yeah let's get this in main. Probably best to not introduce it in a late 9x version

HoustonPutman avatar Mar 11 '24 17:03 HoustonPutman

@HoustonPutman @dsmiley I'm happy to merge main into this again and resolve any conflicts. can you do another review to make sure all is good before merging to main?

risdenk avatar Mar 11 '24 19:03 risdenk

This is an important initiative. If you've given up on it @risdenk, let us know so someone can take over.

dsmiley avatar Apr 25 '24 18:04 dsmiley

This is an important initiative. If you've given up on it @risdenk, let us know so someone can take over.

I'm not actively working on it. @HoustonPutman started this and I just tried to push it a bit further. I might get back to it but nothing immediately. Someone else can definitely push it further.

risdenk avatar Apr 30 '24 13:04 risdenk

Thanks for putting in the work @risdenk ! I will push to add this for 10x only this summer.

HoustonPutman avatar Apr 30 '24 14:04 HoustonPutman