solr-operator
solr-operator copied to clipboard
Support Zookeeper `probes` parameters in Apache Solr Operator helm charts.
Describe the issue:
When deploying SolrCloud via Apache Solr Operator with ensembled Zookeeper, sometimes one of the zookeeper pods gives the below error during the start:
2021-03-29 13:33:56,645 [myid:2] - ERROR [main:QuorumPeerMain@113] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: My id 2 not in the peer list
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1073)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
Possible solutions:
As per the discussions in the GitHub issue-315, they are suggesting increasing the probes.readiness.initialDelaySeconds
from default 10
to 30/60
sec.
Can we add support for the zookeeper config.*
parameters in Apache Solr Operator helm charts?
@iampranabroy did you found an interim solution? Facing the same issue and I think (for now) the only way to go is to deploy an separated zookeeper cluster.
Will dig into this and will submit an PR. Shouldn't be that hard I think.
Hey @mmoscher - As of now, NO. If you can raise a PR that would be great. @HoustonPutman - If there are any upcoming minor releases, can we add this item?
~However, can confirm that the described solutions, i.e. increasing the livenessProbe.initialDelaySeconds, works. Setting this to 30s I was able to successfully deploy a zookeeper cluster with replicas > 1.~
//Edit: false positive ... just had a bunch of luck. For now I'm unable to successfully (re-)deploy a zookeeper cluster. Let's move this discussion back to: https://github.com/pravega/zookeeper-operator/issues/315
@mmoscher We can definitely add probes support through the Solr Operator, but just to make sure you solved this issue independently from any Solr/ZK settings correct?
@HoustonPutman yes, solved it without using any probes. The problem was related to wrong NetworkPolicies and old (maybe corrupted) configs in the zookeeper PVC, cf. https://github.com/pravega/zookeeper-operator/issues/315#issuecomment-1259187314
Hey, @mmoscher - Thanks for your response.
In my case, I have the Solr cluster and zookeeper cluster deployed in the same namespace, but I have seen this error several times. If we can add the support for probes.readiness.initialDelaySeconds
, we can see if that resolves the problem.
@mmoscher - Do you have your zookeeper
and solr
deployed in the same namespace or a different namespace? Was curious about allow-zookeeper-access: true
@iampranabroy yes, all resources (Solr + ZK) in the same namespace with NetworkPolicies denying all pod's egress traffic.