vcluster icon indicating copy to clipboard operation
vcluster copied to clipboard

Support stretching vcluster to multiple host clusters

Open olljanat opened this issue 2 years ago • 7 comments

Is your feature request related to a problem?

From fault tolerance and disaster recovery point of view it would be better to have three different host clusters on different datacenters than having one stretched cluster. Then if there is network/power/etc failure on one datacenter or if one host cluster upgrade goes seriously wrong others would continue running.

However if then each of those host cluster would run separate vcluster instance then it need to be taken care of on app CI/CD pipelines and who ever need to troubleshoot environment would need connect all of those vclusters.

Related to #193 as this would make possible to move workloads between host clusters online.

Which solution do you suggest?

For now I would like to understand if it even on theory possible stretch vcluster to multiple host clusters? How big changes it would need to vcluster? What would be cons on that kind of solution?

As far I understand the etcd should works fine as long values are got for these https://github.com/loft-sh/vcluster/blob/3241bc77e50f064afd07bf0ee004f12c618fe4d2/charts/k8s/templates/etcd-statefulset.yaml#L88-L90 Most probably best way would be give --initial-cluster-state=new value for one etcd instance and --initial-cluster-state=existing for others.

Also service CIDR most probably would need to be same on all host clusters.

Based on https://github.com/loft-sh/vcluster/blob/3241bc77e50f064afd07bf0ee004f12c618fe4d2/charts/k8s/templates/syncer-deployment.yaml#L101-L102 only one syncer process is actually working as leader so most probably what would happen is that pods would get scheduled only to host cluster where it is active.

So afaiu what is missing is that syncer should at least know api-server endpoint to all the host clusters, nodes from host cluster should be synced to vcluster and scheduling sync should be done based on which node pod would be "allocated" inside of vcluster.

Which alternative solutions exist?

No response

Additional context

No response

olljanat avatar Apr 19 '22 12:04 olljanat

@olljanat thanks for creating this issue! This is also something we have thought about and its definitely possible, although it would require quite some rewriting of vclusters code (but doable). One of the biggest difficulties right now is that vcluster would require a global network across the multiple host clusters, however this could be achieved through submariner. This would be sort of the requirement for running vclusters across multiple host clusters or otherwise networking would not work as expected. Besides that persistent storage could get problematic as storage would only be available in certain host clusters.

While this is a super exciting feature from a technical point of view, I'm still not sure how useful this feature would be actually in reality. vcluster would still require a super-host cluster as submariner requires it too which essentially is the single point of failure cluster again. Besides that its definitely interesting in terms of workload distribution especially with our new feature where you can run a scheduler inside the vcluster which you could use to schedule workloads across different clusters automatically through regular Kubernetes affinities and topologies. So it's definitely worth investigating further.

FabianKramm avatar Apr 20 '22 09:04 FabianKramm

One of the biggest difficulties right now is that vcluster would require a global network across the multiple host clusters, however this could be achieved through submariner.

Alternatively you can also use Calico and do BGP peering with top of rack switches. That together with configuration where you also advertise Kubernetes service IP addresses and disable outgoing NAT from IP pools you can make all pod and service IPs achievable also outside of Kubernetes cluster (configuration which we are using).

Nowadays that configuration is most commonly used in on-prem but it looks to be that Azure, AWS and GCP does support BGP peering with other devices/processes so it should be doable on there too.

Besides that persistent storage could get problematic as storage would only be available in certain host clusters.

Sure but that is actually only way how you can make sure that storage system is not single point of failure. What you want to do instead of is let application cluster (e.g. etcd, Redis, RabbitMQ, etc...) make sure that you actually have multiple copies of data which is written to those different storage systems.

vcluster would still require a super-host cluster as submariner requires it too which essentially is the single point of failure cluster again.

Super-host cluster shouldn't be needed as long there is odd number of syncer processes and each of those are running on different host cluster. Then vcluster can works as long more than half of those are running (e.g. 2/3 or 3/5) and those can see each others. Also in case if some reason connectivity between all host clusters would go down it shouldn't prevent any existing pods running on there. Only thing what it means is that vcluster must stay on read-only mode until connectivity between syncer processes works again.

olljanat avatar Apr 20 '22 10:04 olljanat

FYI. To simplify to developing this and later verifying it on e2e tests (if it gets implemented) I created now scripts which can be used to spin three kind+calico clusters locally and setup BGP peering between them. Scripts can be found from https://github.com/olljanat/vcluster/tree/d7344790bb85d6d8b0bf86f2b4fc119804376499/hack/multi-cluster just copy to those locally and run ./start.sh

EDIT: I also tried to enable service IP advertisement on later version of that code. However it looks that it will end up to problems if same service CIDR is used on all host clusters. So to get this one working we need use different on each cluster and need to be able to handle that situation some how.

olljanat avatar Apr 20 '22 19:04 olljanat

Another possible idea for this problem would be to use vcluster alongside liqo, which essentially allows using nodes to schedule workloads to other clusters. Combined with vcluster this would mean we could schedule vcluster workloads as well as parts of the vcluster itself onto that nodes and essentially enable multi-cluster functionality.

We need some more investigation around this topic, but this wouldn't require any changes in vcluster itself to enable multi-cluster capabilities.

FabianKramm avatar May 26 '22 13:05 FabianKramm

Yes, with quick look I really like liqo architecture. Especially the parts it allows mix clusters with different CNIs and on-prem+managed. Definitely worth to investigate more.

olljanat avatar May 26 '22 14:05 olljanat

I think that we need wait 0.5 mentioned here https://github.com/liqotech/liqo#roadmap as vcluster will need this feature which is targeted to it:

Introduce the support to offload applications that need to contact the home API Server (e.g. operators and some db applications).

https://github.com/liqotech/liqo/issues/1185

Other things which I'm not too sure are the facts that:

  1. Whole remote cluster is shown as single node in local cluster (created https://github.com/liqotech/liqo/issues/1249 about it)
  2. Namespace view is different depending which cluster you are looking for (EDIT: that probably can be solved by using NamespaceMappingStrategy=EnforceSameName setting)

olljanat avatar May 29 '22 16:05 olljanat

Hi! Liqo maintainer here :-)

I have to admit we haven't managed yet to investigate the combination of vcluster and liqo (we hope to be able to give it a try soon), but we feel it might be definitely interesting.

I think that we need wait 0.5 mentioned here https://github.com/liqotech/liqo#roadmap as vcluster will need this feature which is targeted to it:

Introduce the support to offload applications that need to contact the home API Server (e.g. operators and some db applications).

liqotech/liqo#1185

A first implementation of the feature required to support offloaded applications requiring to interact with the home API server is already merged into master, and will be included in the next release which is planned in one-two weeks (although it won't include the other items of the roadmap). There are still some limitations (i.e., mainly, it does not support the TokenRequest API), but it should work in most situations.

Other things which I'm not too sure are the facts that:

1. Whole remote cluster is shown as single node in local cluster (created [[Feature] Support foreign cluster nodes with their real names liqotech/liqo#1249](https://github.com/liqotech/liqo/issues/1249) about it)

It is unclear to me whether this is a strong requirement to make things work, or it could be just useful to bring in different optimizations. Nonetheless, we see the reasons behind such proposal, and we are discussing about whether we could/should add the support for it alongside the current approach.

2. Namespace view is different depending which cluster you are looking for (EDIT: that probably can be solved by using [NamespaceMappingStrategy=EnforceSameName](https://doc.liqo.io/usage/namespace_offloading/) setting)

Yes, the EnforceSameName strategy ensures that remote namespaces have the same name as the corresponding one in the local cluster, which is typically a requirement to make cross-namespace DNS resolution works out of the box. All other resources are replicated with the same name in the remote namespace, and should lead to no concerns.

giorio94 avatar May 30 '22 08:05 giorio94