[Backport v2.6] [BUG] rke2/k3s proxied custom cluster unable to finish rancher-system-agent with ` Main process exited, code=exited, status=2/INVALIDARGUMENT`
This is a backport issue for https://github.com/rancher/rancher/issues/39066, automatically created via rancherbot by @sowmyav27
Original issue description:
Rancher Server Setup
- Rancher version: v2.6.9-rc1
- Installation option (Docker install/Helm Chart): docker
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): n/a
- Proxy/Cert Details: rancher-signed
Information about the Cluster
- Kubernetes version: 1.24.4+rke2r1 and 1.24.4+k3s1
- Cluster Type (Local/Downstream): downstream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): custom
User Information
- What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom): standard user, cluster owner
Describe the bug
when deploying a custom cluster that uses a proxy, the initial connection between rancher and the node is made, however registration hangs and never finishes.
To Reproduce
- deploy a normal (public) rancher setup
- deploy a proxy in an environment that you can deploy private nodes to
- create a custom cluster in rancher that is configured to use the proxy
- run the rancher cmd on a private node with access to the proxy
Result
node hangs and does not finish registering
logs in rancher UI:
waiting for agent to checkin and apply plan
logs on the private node from rancher-system-agent:
panic: error while connecting to Kubernetes cluster with nullified CA data: Get "https://<IP>/version": x509: certificate signed by unknown authority
rancher-system-agent[8612]: goroutine 9 [running]:
rancher-system-agent[8612]: github.com/rancher/system-agent/pkg/k8splan.(*watcher).start(0xc00028a200, {0x18bd5c0?, 0xc0001184c0})
rancher-system-agent[8612]: /go/src/github.com/rancher/system-agent/pkg/k8splan/watcher.go:99 +0x9b4
rancher-system-agent[8612]: created by github.com/rancher/system-agent/pkg/k8splan.Watch
rancher-system-agent[8612]: /go/src/github.com/rancher/system-agent/pkg/k8splan/watcher.go:63 +0x155
rancher-system-agent.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: rancher-system-agent.service: Failed with result 'exit-code'.
registration cmd used
HTTP_PROXY="<proxy>" HTTPS_PROXY="<proxy>" NO_PROXY="localhost,127.0.0.1,0.0.0.0,10.0.0.0/8,cattle-system.svc" curl --insecure -fL https://173.255.252.190/system-agent-install.sh | sudo HTTP_PROXY="<proxy>" HTTPS_PROXY="<proxy>" NO_PROXY="localhost,127.0.0.1,0.0.0.0,10.0.0.0/8,cattle-system.svc" sh -s - --server https://<server> --label 'cattle.io/os=linux' --token <token> --ca-checksum <checksum> --etcd --controlplane --worker
Expected Result
rke2/k3s should be able to provision when using a proxy
Screenshots
Additional context
- rke1 with the same settings will work
- multi-node doesn't have the same logs, but ends up in the same state
- rke2/k3s logs on the node appear to be the same issue
- custom cluster rke2/k3s without proxy settings on the same setup works fine
Is this a regression from v2.6.8 or was it always broken?
I believe it was working on 2.6.3, but I have not tested that yet it is not working on 2.6.8, so this will go in 2.6.10 and not block the current release
system-agent PR: https://github.com/rancher/system-agent/pull/97 wins PR: https://github.com/rancher/wins/pull/145
reopening for now, as on 2.6-head (3952135) I'm still seeing the same issue. It appears that system-agent is still on v0.2.10, and the downstream node still has the error:
rancher-system-agent.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Moving back to test, https://github.com/rancher/system-agent/releases/tag/v0.2.11 was erroneously marked as pre-release.
did more testing with v0.2.11 however unfortunately the same error occurs.
tested on v2.6-head (e139976) in both a public rancher and rancher behind proxy:
- provision rke2 cluster with default settings -- pass
- provision rke2 cluster using a proxy -- pass
- provision rke2 dedicated node-per-role cluster using a proxy -- pass
- provision k3s cluster with default settings -- pass
- provision k3s cluster using a proxy -- pass
- provision k3s dedicated node-per-role cluster using a proxy -- pass also performed basic network checks on each cluster -- pass note: k3s v1.24.4+k3s1 has a known issue, but using v1.24.6+k3s1 did provision when using the proxy