liqo
liqo copied to clipboard
[Feature] offloading to nodes with specific labels
Is your feature request related to a problem? Please describe. I've walked through the examples and documents, and would like to use Liqo for our team's multicluster management.
Our use-case & situation can be summarized as below
- we're behind a VPN
- we run a on-premise k8s cluster, and a EKS cluster
- offload on-premise cluster's namespaces to EKS cluster
- namespace
eks-gpu
: offloads to EKS withnvidia.com/gpu=true
label nodes - namespace
eks-spot
: offloads to EKS withcapacity=spot
label nodes
- namespace
Describe the solution you'd like
Currently, it seems that created virtual node summarizes all nodes in the remote cluster.
I suggest using nodeSelector labels when offloading. So that virtual nodes reflects nodes only with the matching selector. Also injecting a nodeSelector
term on offloaded pods will be useful (but OK to be done outside of Liqo).
Describe alternatives you've considered
I can write a mutatingwebhook
in the EKS cluster to inject the nodeSelector
term, but then the virtual node's contains too much unnecessary(or even confusing) information. Enough resource(CPU, memory) in virtual node, but pods not scheduling.
Additional context This feature can also help in multi-tenant scenarios, where you might not want to dedicate a cluster in every offloaded namespace.
- run multiple namespace and node groups in EKS cluster
- tenant1 gets
tenant1
namespace(on-premise) andtenant1-eks
namespace(offloading). - tenant2 gets
tenant2
namespace(on-premise) andtenant2-eks
namespace(offloading). - tenant1 cannot schedule pods nor get information about tenant2 nodes.
In https://github.com/liqotech/liqo/issues/1249 , @giorio94 suggested creating local shadow node for each remote node. nodeSelector
feature will also help this scenario too.
can this be done here
https://github.com/liqotech/liqo/blob/master/cmd/virtual-kubelet/root/root.go#L145
nodeRunner, err := node.NewNodeController(
nodeProvider, nodeProvider.GetNode(),
localClient.CoreV1().Nodes(), // add nodeselector label here
node.WithNodeEnableLeaseV1(localClient.CoordinationV1().Leases(corev1.NamespaceNodeLease), int32(c.NodeLeaseDuration.Seconds())),
...
Hi @DevSusu,
If I understand it correctly, you would like to specify a node selector to offer only a subset of the resources available in the provider cluster (i.e., those associated with the nodes matching the selector). This feature makes sense to me (and also relates to excluding the tainted control plane nodes from the computation); it would require some modifications in the computation logic and in the shadow pod controller, to inject the given node selectors for offloaded pods. I cannot give you any timeline for this right now, but I'll add it to our roadmap for the future. If you would like to contribute, I can also give you more information about where to extend the logic to introduce it.
As for the piece of code you mentioned, that is the controller which deals with the creation of the virtual node. Still, the amount of resources associated with that node are taken from the ResourceOffer
, which is created by the provider cluster through the above mentioned computation logic, and then propagated to the consumer cluster. Hence, you cannot use that to tune the amount of resources.
@giorio94 thanks for the prompt response, I really appreciate it
you would like to specify a node selector to offer only a subset of the resources available in the provider cluster
thanks for the summary 😄
I would like to contribute, it'll be great if you can give out some starting points!
Nice to hear that! In the following you can find some additional pointers:
- the local-resource-monitor is the component which includes the logic to compute the amount of resources to be offered to remote clusters. It includes two informers, one for the nodes and the other for the pods, and continuously keeps track of the amount of free resources. The easy part here is to filter out the nodes that do not match a given label selector. More tricky, instead, is to filter out pods hosted by excluded nodes, since you might get a notification for a pod before you get that for the hosting node, and you need to somehow cache that info until you know whether the node matches or not the selector.
- the shadow-pod-controller is the controller that creates remote pods starting from shadowpod resources (an abstraction we use for increased reliability). It should be fairly simple to enforce the node selector there.
- here you can find the logic to create a node selector from the parameters of a command, which might be useful also in this case.
Feel free to ask for any further information.
@giorio94 , thanks for the pointers
I've skimmed through, and have a question/suggestion
instead of caching the pods info, what about managing pod informers per node? when a node informer sends that a new node has been added, then register a pod informer with the nodeName
field selector. (when deleted, vise-versa) Thus 1 node informer with label selector, and pod informer per node.
this way, we don't need to worry about the timing issue you mentioned. caching the pod info needs some guessing about how long we should wait until the node infos come in, and that delay will effect the virtual node resource update period also.
instead of caching the pods info, what about managing pod informers per node? when a node informer sends that a new node has been added, then register a pod informer with the nodeName field selector. (when deleted, vise-versa) Thus 1 node informer with label selector, and pod informer per node.
To me, the approach you propose makes definitely sense, and it also reduces the amount of pods observed by the informers in case most nodes are excluded by the label selector.
As for the caching one, you could avoid guessing through a refactoring of the data structure towards a more node oriented approach (i.e,, storing the resources used by each peered cluster per physical node, rather than as a whole), and then marking whether a given node shall be included or excluded. This would also allow to cover the case in which node labels are modified, changing whether it matches the selector.
I personally have no particular preference. I feel your proposal to be a bit cleaner, although it also requires some more work/refactoring to integrate it with the existing code (as the data structure changes are probably needed nonetheless to account for cleanup when a node is removed).