liqo
liqo copied to clipboard
liqoctl unpeer doesn't remove broken peering connection
What happened:
If I try to create a peering between two clusters and for some reason it fails (e.g. a firewall rule prevents communication with the liqo-auth service or liqo-auth and liqo-gateway are deployed as load balancers with a layer two MetalLB without being directly reachable) two crd resources are left tunnelendpoint.net.liqo.io and networkconfigs.net.liqo.io respectively. These two resources are not deleted with a "liqoctl unpeer
What you expected to happen:
I expect that a liqoctl unpeer will also remove broken liqo installations, or at least the liqoctl uninstall will not start.
How to reproduce it (as minimally and precisely as possible):
I know the steps I list lead to a broken liqo installation, and they are wrong!
- Create the kind clusters using the quick-start example.
- Install MetalLB on each cluster with two different ip pools (e.g. 10.96.100.0/24 and 10.104.100.0/24)
- Install liqo on each cluster, requiring that the liqo-auth and the liqo-gateway services are deployed with Load Balancer. (
liqoctl install kind --cluster-name rome --set gateway.service.type=LoadBalancer --set auth.service.type=LoadBalancer
) - Try in-band peering (
liqoctl peer in-band --remote-kubeconfig liqo_kubeconf_milan
) - Wait for the peering to fail. Now there are two crd resources created that are broken (tunnelendpoints.net.liqo.io and networkconfigs.net.liqo.io).
-
liqoctl unpeer milan
returns a success, but the resources are still there and if I try to do aliqoctl uninstall
it passes the uninstallation checks and then fail on timeout.
Anything else we need to know?:
A quick workaround is just deleting the resources manually ( kubectl delete tunnelendpoints.net.liqo.io -n <liqo-tenant-namespace>
and kubectl delete networkconfigs.net.liqo.io -n <liqo-tenant-namespace>
).
I repeat that I know these steps are not the correct ones for installing and configuring liqo, but in doing some trial and error I came across this situation that seemed anomalous.
liqo-controller logs:
E0313 10:06:38.671955 1 foreign-cluster-controller.go:218] Failed to ensure identity for remote cluster "milan": failed to send identity request: Post "https://10.203.0.3:443/identity/certificate": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I0313 10:06:38.674710 1 trace.go:219] Trace[2000615844]: "Reconcile" ForeignCluster:milan (13-Mar-2023 10:06:33.669) (total time: 5004ms):
Trace[2000615844]: ---"ForeignCluster status update" 5004ms (10:06:38.674)
Trace[2000615844]: [5.004700396s] [5.004700396s] END
liqoctl uninstall error:
ERRO Error uninstalling Liqo: timed out waiting for the condition
liqoctl unpeer milan result:
INFO Outgoing peering marked as disabled
INFO Successfully disabled outgoing peering to the remote cluster "milan"
liqoctl peer command error:
ERRO (local) Failed establishing networking to the remote cluster "milan": timed out waiting for the condition
Environment:
- Liqo version: 0.7.1
- Kubernetes version (use
kubectl version
): v1.26.2 - Cloud provider or hardware configuration: Kind / Kubeadm
- Network plugin and version: kindnet / flannel
- Install tools: liqoctl
- Others: None
Hi, I think I have almost same problem as yours. Firstly, I peer two clusters successfully. When I try to unpeer them, problem comes out.
liqoctl unpeer error:
ERRO Failed disabling outgoing peering to the remote cluster "milan": timed out waiting for the condition
then, I try these steps (kubectl delete tunnelendpoints.net.liqo.io -n <liqo-tenant-namespace>
and kubectl delete networkconfigs.net.liqo.io -n <liqo-tenant-namespace>
). And peer the two clusters again.
liqoctl peer error:
ERRO Failed activating outgoing peering to the remote cluster "milan": timed out waiting for the condition
But problem still exists. Do you know how to fix it? Please help me. 😭
HI @Wangxinxinhappy what I suggest to you is to remove the foreignclusters resources and the liqo-tenant-* namespaces (on both sides). If you meet some problems deleting the namespaces check that resourceoffers in tenant namespaces have been deleted (if not delete the finalizers on them by hand).
HI, @cheina97 Thanks for your reply!
I just remove the foreignclusters resources and the liqo-tenant-* namespaces. And then reinstall liqo on both sides. But liqoctl peer out-of-band nameless-brook *
still failed:
INFO Peering enabled
INFO Authenticated to cluster "nameless-brook"
ERRO Failed activating outgoing peering to the remote cluster "nameless-brook": timed out waiting for the condition
Hi @cheina97 I see this issue is still present in the latest (v0.10.1) Liqo and happens when peering fails for any reason.
In my case though, the uninstall fails trying to remove tunnelendpoints.net.liqo.io
. I saw it runs a job liqo-pre-delete
which tries to clean up everything but it fails for this resource kind. I cannot delete this tunnelendpoint myself either. It just timeouts. I also cannot edit it to remove the finalizers:
finalizers:
- liqo-gateway.net.liqo.io
- liqo-route.10.5.0.5.net.liqo.io
kubectl throws back at me:
error: tunnelendpoints.net.liqo.io "misty-thunder-690265" could not be found on the server
The edits you made on deleted resources have been saved to "/tmp/kubectl-edit-1165863843.yaml"
Yet it is very much found and blocking the uninstall:
NAME PEERING CLUSTER BACKEND TYPE CONNECTION STATUS AGE
misty-thunder-690265 misty-thunder wireguard Connected 2d3h
FWIW, its namespace was already deleted...
Workaround
All right, I recreated the namespace and then I was able to remove the finalizers (and had to also delete the invalid ownerReferences) and delete it. Oh boy, it seems the unpeer + uninstall create some nice confusion together! 😅
Hi @yoctozepto, we know that peering is one of the most problematic parts in Liqo, this year we are working on making Liqo modular and one of the objectives is to remove the actual peer mechanism and replace it with a declarative and clean approach.
I'm sorry for your issue and happy you found a solution.
In the next months we are going to release the new Liqo network, which replace the current one, it will be independent from the rest of Liqo and will solve problems like this one. Stay tuned
Thanks for the summary @cheina97 and no need to be sorry! It works very fine so far except for these quirks. I have seen these various improvements being mentioned around the issues I happened to see when filtering for relevant ones. Do you have a central place where you track these design decisions and the related work? Keeping my fingers crossed and looking forward to seeing this future liqo!
Network modularity lacks a public design on github at the moment. Surely in the future, we will share better insight about design for the next modularity steps.