solr-operator How to prevent node rotation behavior from causing cluster instability

We are just starting out with the Solr Operator, and intend on moving several large Solr clusters over to leveraging the operator for their management. In our initial tests, we've encountered a situation that seems incredibly risky, and we would like to understand whether there is a reasonable solution for this in place, or good suggestions for how to improve reliability around it.

The logic around SolrCloud.Spec.updateStrategy being Managed (https://apache.github.io/solr-operator/docs/solr-cloud/solr-cloud-crd.html#update-strategy) means that the operator will never take an action that risks cluster stability (shutting down a pod that would result in no live replicas etc...) This is fantastic, but only relates to actions that the operator itself would make (statefulset updates etc...), and doesn't appear to come into play during normal kubernetes operations, such as node rotations.

On an EKS cluster, when a node group is refreshed, the nodes are marked for termination within their autoscaling groups, and subsequently their pods are drained from the nodes that are to be shut down, and re-scheduled to valid nodes. Normal k8s operations to prevent service disruptions during this type of an event are to utilize Pod Disruption Budgets, which prevents the draining nodes from stopping their pods if it would cause a disruption. This leverages Readiness/Liveness status to determine when a disruption would occur, and is generally a reliable way of preventing applications from becoming unavailable.

With Solr, there is another level of abstraction, as a Solr pod being "ready" doesn't mean that all of the cores on that node are available/replicated, and thus a pod disruption budget, which only monitors that readiness state, may perceive that it is safe to delete an arbitrary pod in the cluster without the necessary logic (which the Operator has) of checking whether that pod would cause a disruption should it be shut down.

Since with a large cluster, nodes/pods coming up and down may take time to recover, and without a PDB, you may risk multiple pods going down simultaneously, there is a risk (we perceive) that Solr's availability could be at risk should a node rotation or other form of pod deletion etc... occur outside the Operator's pervue.

So, my question is: What methodology is recommended for eliminating this risk? Are there configurations we've overlooked that will reduce this risk? Has the community simply accepted this limitation and found ways to reduce the odds of being impacted? (are we maybe overreacting, and this isn't actually a risk?)

Sep 09 '22 22:09 joshsouza

This is a very good callout, so thank you for bringing it up.

We can easily add a PodDisruptionBudget for the entire SolrCloud cluster, and the maxUnavailable can be populated with the SolrCloud.spec.updateStrategy.managed.maxPodsUnavailable value. This is a pretty good first-step and gets us halfway there.

The next half would be replicating the SolrCloud.spec.updateStrategy.managed.maxShardReplicasUnavailable functionality through PDBs. Through the managed update code, we already understand the nodes that each shard resides on, so it wouldn't be far-fetched to create a PDB for every shard, using a custom labelSelector to pick out the node-name labels of nodes that we already know host that shard. We could even just routinely check (every minute or so) to update/create/delete PDBs, as we aren't listening to the cluster state in the cloud. The PodDisruptionBudget documentation tells us that we can't use maxUnavailable, as PDBs with custom labelSelectors can only use int-valued minAvailable. That's fine because we can always convert between the two, since we know the number of Nodes that host the shard.

However, there's another rule for PDBs that makes this part of the solution untenable. It specifies that you can only have 1 PodDisruptionBudget per-pod, and for this solution we would need to have a PDB for every shard that lives on that pod, which will almost certainly be >1. (Otherwise the general cluster PDB should be fine to use)

Hopefully Kubernetes will eventually remove the PDB per-pod limit, then we can fully (and not-too-difficultly) implement shard-level PDBs managed by the Solr Operator. In the meantime, we should go-ahead and implement the per-cluster PodDisruptionBudget and fill it with the value used in the managed update settings.

Given the limitations, what are your thoughts on moving forward with the cluster-level PDB @joshsouza ?

Sep 13 '22 21:09 HoustonPutman

I 100% support adding a cluster-level PDB here, as that's definitely a first step towards success. My concern is that the PDB will ensure we don't take a pod down if one is already down, but in the scenario where a pod just started and is coming online/doing a recovery for some large dataset that takes longer than the ReadinessProbe, from k8s' perspective it's safe to take down another pod, but from Solr's perspective it may be a risky operation.

A cluster-level PDB will at least reduce the level of risk here, but to the point of your (very thorough, thank you) note above, it's a step on the path to a final solution.

Sep 14 '22 16:09 joshsouza

Ideas our team has been tossing around in discussions:

startupProbe may also reduce risk (though still allows for some edge cases). If a newly starting pod had a startup probe that didn't go green until all of the shards assigned to that pod were active/recovered, that could prevent what I described above. However there's secondary risk there of stuff getting stuck (I.E. what if there's an inactive shard on that pod?)

Sidecar readiness - I need to check up on the docs/test myself, but I'm curious if the entire pod needs to pass all of its readiness checks in order to be active in the service, or if we can leverage a sidecar whose readiness check just verifies that all shards on that pod are in an active/ok state. If that works (that sidecar flapping doesn't impact the pod's availability to the real solr service) then a PDB would enforce the desired behavior (it would never allow for a pod to be taken out of commission if there was one in a not-ready state in the cluster). Side effect is that this could have detrimental effects on situations where it's ok to take down some pods while others are recovering, and slowing rotations etc..., so it's still just a half-measure to your suggestion of a real check that uses Solr logic to indicate what pods are acceptable to disrupt via a PDB that ties together pods that own a shard. I'm just not sure that's going to be on a realistic horizon from the k8s timeline perspective.

Sorry to brain dump, just thought I'd add what's on my mind to the conversation in case it's helpful.

Sep 14 '22 16:09 joshsouza

(Read up more carefully on the docs, we can't use the sidecar idea, because it would indicate the whole pod isn't ready, and drop it from the service)

Sep 14 '22 16:09 joshsouza

Just had a thought on this after perusing the docs further to see if there's anything I could find to support our end goals within current constraints: https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget I can specify a disruption maxUnavailable of 0. This will prevent any voluntary disruptions entirely. So if the operator managed the PDB, and set the maxUnavailable to, for example 1, as long as every shard is happy, but when it detects that there are shards in a recovering state where an additional pod going down risks reliability, it can adjust the PDB to set maxUnavailable to 0 until that condition passes, then we could prevent additional eviction behavior until it's safe.

I think this is a potentially viable solution until the platform supports multiple PDB's on a pod.

What do you think?

Sep 14 '22 16:09 joshsouza

It also occurred to me that if each SolrCloud had a PDB with a maxUnavailable of 0 at all times, the Solr Operator could monitor the cluster for node rotation behavior (node drain events, etc....) and take the appropriate action itself (I believe pods can still be deleted, or otherwise shut down). I neither know what would need to be monitored, nor how to shut down Solr pods while a PDB would normally prevent actions, but that may be a thought process worth pursuing at some point, since the Operator already has the logic baked in to know when a pod is safe to delete/disrupt.

Sep 15 '22 15:09 joshsouza

Just had a thought on this after perusing the docs further to see if there's anything I could find to support our end goals within current constraints: https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget I can specify a disruption maxUnavailable of 0. This will prevent any voluntary disruptions entirely.

That is very interesting, and could certainly be something for us to look into.

If we go further down that idea, we could have a PDB for each pod individually, and basically set the minAvailable to either 0 or 1 depending on whether it's ok to take down that pod at any given time (given the same logic we use for restarts). That gives us a much more fine-tuned ability to control this.

It also occurred to me that if each SolrCloud had a PDB with a maxUnavailable of 0 at all times, the Solr Operator could monitor the cluster for node rotation behavior

This is probably the best solution, if we can get it right. There are things the Solr Operator generally wants to control before letting a pod get deleted, such as moving replicas off of Solr Node with ephemeral data. So if we are able to do that then I think we go for it.

The new DistruptionCondition stuff might give us that info, but its alpha in v1.25, so probably won't be available by default for at least a few more versions. I'm also not sure if it will put the condition on the pod if the PDB says not to delete it... But it would certainly be the easiest way forward if we wanted to do this.

Either way, we don't need to be perfect from the beginning. I say that for now, we either go cluster-wide PDB or do per-pod PDBs. But I absolutely love this discussion, and with a few new versions of Kubernetes, we can probably get this to an amazing place.

Sep 16 '22 15:09 HoustonPutman

Thanks for all the thoughtful discussion. It hadn't even occurred to me to do a per-pod pdb, but that makes a ton of sense given the context, and I would say that's probably the near-term most viable solution (since there's so much in the air for future k8s revisions, and we wouldn't want to require bleeding edge k8s to run Solr safely).

That said, I think it's worth taking the time to do this right, get other voices, and test things out. In the interim, my team is proceeding with a cluster-wide PDB, and a pod that will flip between 0-1 for availability on that in order to be overly cautious.

I think that's a reasonable option as a stop-gap for us, but I'd love to help where I can in making this a first-party solution.

How can I best help out?

On Fri, Sep 16, 2022, 8:58 AM Houston Putman @.***> wrote:

Just had a thought on this after perusing the docs further to see if there's anything I could find to support our end goals within current constraints: https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget I can specify a disruption maxUnavailable of 0. This will prevent any voluntary disruptions entirely.

That is very interesting, and could certainly be something for us to look into.

If we go further down that idea, we could have a PDB for each pod individually, and basically set the minAvailable to either 0 or 1 depending on whether it's ok to take down that pod at any given time (given the same logic we use for restarts). That gives us a much more fine-tuned ability to control this.

It also occurred to me that if each SolrCloud had a PDB with a maxUnavailable of 0 at all times, the Solr Operator could monitor the cluster for node rotation behavior

This is probably the best solution, if we can get it right. There are things the Solr Operator generally wants to control before letting a pod get deleted, such as moving replicas off of Solr Node with ephemeral data. So if we are able to do that then I think we go for it.

The new DistruptionCondition https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-conditions stuff might give us that info, but its alpha in v1.25, so probably won't be available by default for at least a few more versions. I'm also not sure if it will put the condition on the pod if the PDB says not to delete it... But it would certainly be the easiest way forward if we wanted to do this.

Either way, we don't need to be perfect from the beginning. I say that for now, we either go cluster-wide PDB or do per-pod PDBs. But I absolutely love this discussion, and with a few new versions of Kubernetes, we can probably get this to an amazing place.

— Reply to this email directly, view it on GitHub https://github.com/apache/solr-operator/issues/471#issuecomment-1249533894, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGKLJISYCUZ5N5W7GOD7PDV6SKLRANCNFSM6AAAAAAQJBT6FE . You are receiving this because you were mentioned.Message ID: @.***>

Sep 16 '22 16:09 joshsouza

@joshsouza Please let me know if you try the new version and if it helps resolve the problem. We kind of have a similar scenario. Will give it a try ourselves soon and will share our observations as well.

Thanks, @HoustonPutman for adding it to the new release. 0.7.0

Apr 26 '23 14:04 iampranabroy

Also looking forward to some of the features suggested above...

This probably won't be the route for the operator, but posting an alternative idea here for others. Our centralized K8s management currently requires that we have a PDB of at least 1, so we can't necessarily go with flipping cluster level PDB between 0-1.

We were thinking of making a modified version of /admin/collections?action=CLUSTERSTATUS that essentially throws a non-200 when status is non-GREEN. Then, we would set the readinessProbe to this endpoint. This would involve writing Solr cluster-level plugins, which might not be ideal for those otherwise using vanilla Solr.

Aug 25 '23 16:08 mcarroll1

solr-operator solr-operator copied to clipboard

How to prevent node rotation behavior from causing cluster instability

solr-operator
solr-operator copied to clipboard