solr-operator icon indicating copy to clipboard operation
solr-operator copied to clipboard

Support Zookeeper `probes` parameters in Apache Solr Operator helm charts.

Open iampranabroy opened this issue 2 years ago • 3 comments

Describe the issue:

When deploying SolrCloud via Apache Solr Operator with ensembled Zookeeper, sometimes one of the zookeeper pods gives the below error during the start:

2021-03-29 13:33:56,645 [myid:2] - ERROR [main:QuorumPeerMain@113] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: My id 2 not in the peer list
	at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1073)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)

Possible solutions:

As per the discussions in the GitHub issue-315, they are suggesting increasing the probes.readiness.initialDelaySeconds from default 10 to 30/60 sec. Can we add support for the zookeeper config.* parameters in Apache Solr Operator helm charts?

iampranabroy avatar Sep 21 '22 05:09 iampranabroy

@iampranabroy did you found an interim solution? Facing the same issue and I think (for now) the only way to go is to deploy an separated zookeeper cluster.

Will dig into this and will submit an PR. Shouldn't be that hard I think.

mmoscher avatar Sep 26 '22 14:09 mmoscher

Hey @mmoscher - As of now, NO. If you can raise a PR that would be great. @HoustonPutman - If there are any upcoming minor releases, can we add this item?

iampranabroy avatar Sep 26 '22 14:09 iampranabroy

~However, can confirm that the described solutions, i.e. increasing the livenessProbe.initialDelaySeconds, works. Setting this to 30s I was able to successfully deploy a zookeeper cluster with replicas > 1.~

//Edit: false positive ... just had a bunch of luck. For now I'm unable to successfully (re-)deploy a zookeeper cluster. Let's move this discussion back to: https://github.com/pravega/zookeeper-operator/issues/315

mmoscher avatar Sep 26 '22 14:09 mmoscher

@mmoscher We can definitely add probes support through the Solr Operator, but just to make sure you solved this issue independently from any Solr/ZK settings correct?

HoustonPutman avatar Oct 21 '22 16:10 HoustonPutman

@HoustonPutman yes, solved it without using any probes. The problem was related to wrong NetworkPolicies and old (maybe corrupted) configs in the zookeeper PVC, cf. https://github.com/pravega/zookeeper-operator/issues/315#issuecomment-1259187314

mmoscher avatar Oct 25 '22 11:10 mmoscher

Hey, @mmoscher - Thanks for your response. In my case, I have the Solr cluster and zookeeper cluster deployed in the same namespace, but I have seen this error several times. If we can add the support for probes.readiness.initialDelaySeconds, we can see if that resolves the problem.

@mmoscher - Do you have your zookeeper and solr deployed in the same namespace or a different namespace? Was curious about allow-zookeeper-access: true

iampranabroy avatar Oct 25 '22 14:10 iampranabroy

@iampranabroy yes, all resources (Solr + ZK) in the same namespace with NetworkPolicies denying all pod's egress traffic.

mmoscher avatar Oct 25 '22 17:10 mmoscher