osm icon indicating copy to clipboard operation
osm copied to clipboard

Canonicalize on 1 or multiple meshes per cluster

Open steeling opened this issue 2 years ago • 12 comments

Right now the code is sort of in this mid-way state between allowing multiple meshes in a cluster, and having one mesh per cluster.

Do we have a stance on what we want to do going forward? For instance some feature requests may be simpler to implement based on the single mesh assumption.

FWIW somebody could more or less achieve the benefits of multiple meshes through certificate manipulation if we were to add that feature, but with retaining the concept of a single mesh.

steeling avatar Apr 20 '22 17:04 steeling

That's a good question, and agreed that we're somewhere in between implementations for single vs multiple mesh. I think the driving question that we need to answer here (which I don't have the answer to right now) is if we want OSM to be focused on single or multi-tenant usages for Kubernetes. The argument to supporting multiple meshes in a single cluster would be if there were multiple teams (or even just different software) that shouldn't share the same service mesh.

I'm not sure how valid that requirement is, but I would be curious to hear others' opinions on this.

trstringer avatar Apr 20 '22 18:04 trstringer

In a practical sense, multi-mesh is essentially just multi-trust-domain right? I have a hard time envisioning an organization running multiple trust domains in a single cluster. An empty TrafficTarget already denies traffic, right? Is there a scenario where that's not sufficient?

keithmattix avatar Apr 20 '22 19:04 keithmattix

Here's some context from the time we introduced this feature: The initial idea to support multiple meshes was to address multi-tenancy within the same cluster and provide isolation primitives in terms of policies, for both the user and control plane. This was done using the openservicemesh.io/monitored-by: <mesh-name> label. An immediate use case back then was to allow multiple CI runs from different pull requests running on the same underlying k8s cluster to not interfere with each other. This worked well for a long time. Then came the need to support upgrades, which involved global resources such as CRDs, webhook configs, etc., and designing this for multiple meshes was not an immediate priority for the project.

So the idea behind multiple meshes was to provide logical isolation for the control plane and user policies applicable to a mesh instance. I am unsure if customers have such a requirement, but our use case back then was to allow running multiple OSM instances on the same cluster. If we do not see a need for such a scenario anymore, supporting a single mesh will simplify some of the existing components.

shashankram avatar Apr 20 '22 19:04 shashankram

Multiple CI runs is an interesting case, although I'd imagine that applies to us more so than to customers? If so, it seems like what we have with KIND is a working solution.

My vote would be to remove the ability to have multiple meshes. If we ever do find a strong customer need, we can re-introduce the feature by allowing multiple cert chains (1 per tenant), and mapping those to each logical mesh, which would solve the multiple global resource issue.

In the end I'm not too opinionated with whatever we choose here, but do see value in coming to a decision

steeling avatar Apr 20 '22 21:04 steeling

Multiple CI runs is an interesting case, although I'd imagine that applies to us more so than to customers? If so, it seems like what we have with KIND is a working solution.

My vote would be to remove the ability to have multiple meshes. If we ever do find a strong customer need, we can re-introduce the feature by allowing multiple cert chains (1 per tenant), and mapping those to each logical mesh, which would solve the multiple global resource issue.

In the end I'm not too opinionated with whatever we choose here, but do see value in coming to a decision

I can imagine a similar scenario for customers, though it is uncommon in practice.

Multiple meshes within a cluster are not just about certificates. There is a lot more to it. The multi-mesh feature allows multiple control plane instances to co-exist and manage a logical mesh instance without interfering with each other, ie. a policy applied in 1 mesh will have no bearing on other meshes.

Currently, the multi-mesh feature only lacks upgrade support, as it results in a global state change on the cluster (CRDs, conversion webhooks etc.), and would affect other meshes.

shashankram avatar Apr 21 '22 16:04 shashankram

Just to throw another wrinkle into this, multi-mesh is a superset of a separate, more common problem: canary upgrades between control plane mesh versions. During most implementations of this process, two instances of the control plane are running at the same time, but each instance is handling a subset of resources. Before coming to a decision on the larger question of multi-mesh support, it's probably worth shoring up our user stories to better understand what problems our users will be trying to solve

keithmattix avatar Apr 21 '22 16:04 keithmattix

These are all really good points. From my perspective, we have three options here:

  1. Keep the underlying multi-mesh code as-is and remove the user experience from the CLI so that it isn't confusing.
  2. Remove all the multi-mesh code from the product.
  3. Complete the multi-mesh feature so that is usable and reliable.

My personal vote would be for the first option. Without a heavy user demand I'm not sure if we want to commit to this work in the short-term. But likewise, I don't want to rip out all the code and rely on Git history to piece it back together if/when we decide to re-add this feature in the future.

But agreed, @steeling, the current status of this is not ideal and we should do something to put it in a more consistent state that isn't confusing or misleading.

Curious to others' opinions on this and how you all would like to see this transition.

trstringer avatar Apr 26 '22 13:04 trstringer

These are all really good points. From my perspective, we have three options here:

  1. Keep the underlying multi-mesh code as-is and remove the user experience from the CLI so that it isn't confusing.
  2. Remove all the multi-mesh code from the product.
  3. Complete the multi-mesh feature so that is usable and reliable.

My personal vote would be for the first option. Without a heavy user demand I'm not sure if we want to commit to this work in the short-term. But likewise, I don't want to rip out all the code and rely on Git history to piece it back together if/when we decide to re-add this feature in the future.

But agreed, @steeling, the current status of this is not ideal and we should do something to put it in a more consistent state that isn't confusing or misleading.

Curious to others' opinions on this and how you all would like to see this transition.

I agree with everything mentioned. I think the one question that remains would be, if adding a new feature, should we maintain the ability to leverage a second mesh? If the answer is no, then over a long enough period of time we would likely end up with a frankenstein feature that would need to be rewritten from scratch.

That said, I think it's a good approach in the interim, and I'd add my vote that we go with option #1 that @trstringer mentioned. Side note, this would have immediate implications for #4613

steeling avatar Apr 27 '22 19:04 steeling

This issue will be closed due to a long period of inactivity. If you would like this issue to remain open then please comment or update.

github-actions[bot] avatar Jun 27 '22 00:06 github-actions[bot]

Issue closed due to inactivity.

github-actions[bot] avatar Jul 04 '22 00:07 github-actions[bot]

Added default label size/needed. Please consider re-labeling this issue appropriately.

github-actions[bot] avatar Jul 13 '22 00:07 github-actions[bot]

Hi @shashankram

Hope you are doing well. Do we have any plan support multiple mesh in future? Tks.

tfbubu111 avatar Jul 14 '22 03:07 tfbubu111

This issue will be closed due to a long period of inactivity. If you would like this issue to remain open then please comment or update.

github-actions[bot] avatar Jan 22 '23 00:01 github-actions[bot]

Issue closed due to inactivity.

github-actions[bot] avatar Jan 29 '23 00:01 github-actions[bot]