rancher icon indicating copy to clipboard operation
rancher copied to clipboard

[Backport v2.6] [BUG] rke2/k3s proxied custom cluster unable to finish rancher-system-agent with ` Main process exited, code=exited, status=2/INVALIDARGUMENT`

Open rancherbot opened this issue 3 years ago • 2 comments

This is a backport issue for https://github.com/rancher/rancher/issues/39066, automatically created via rancherbot by @sowmyav27

Original issue description:

Rancher Server Setup

  • Rancher version: v2.6.9-rc1
  • Installation option (Docker install/Helm Chart): docker
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): n/a
  • Proxy/Cert Details: rancher-signed

Information about the Cluster

  • Kubernetes version: 1.24.4+rke2r1 and 1.24.4+k3s1
  • Cluster Type (Local/Downstream): downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): custom

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom): standard user, cluster owner

Describe the bug

when deploying a custom cluster that uses a proxy, the initial connection between rancher and the node is made, however registration hangs and never finishes.

To Reproduce

  • deploy a normal (public) rancher setup
  • deploy a proxy in an environment that you can deploy private nodes to
  • create a custom cluster in rancher that is configured to use the proxy
  • run the rancher cmd on a private node with access to the proxy

Result node hangs and does not finish registering logs in rancher UI: waiting for agent to checkin and apply plan

logs on the private node from rancher-system-agent:

panic: error while connecting to Kubernetes cluster with nullified CA data: Get "https://<IP>/version": x509: certificate signed by unknown authority
 rancher-system-agent[8612]: goroutine 9 [running]:
rancher-system-agent[8612]: github.com/rancher/system-agent/pkg/k8splan.(*watcher).start(0xc00028a200, {0x18bd5c0?, 0xc0001184c0})
 rancher-system-agent[8612]:         /go/src/github.com/rancher/system-agent/pkg/k8splan/watcher.go:99 +0x9b4
 rancher-system-agent[8612]: created by github.com/rancher/system-agent/pkg/k8splan.Watch
 rancher-system-agent[8612]:         /go/src/github.com/rancher/system-agent/pkg/k8splan/watcher.go:63 +0x155
 rancher-system-agent.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
 systemd[1]: rancher-system-agent.service: Failed with result 'exit-code'.

registration cmd used

HTTP_PROXY="<proxy>" HTTPS_PROXY="<proxy>" NO_PROXY="localhost,127.0.0.1,0.0.0.0,10.0.0.0/8,cattle-system.svc" curl --insecure -fL https://173.255.252.190/system-agent-install.sh | sudo HTTP_PROXY="<proxy>" HTTPS_PROXY="<proxy>" NO_PROXY="localhost,127.0.0.1,0.0.0.0,10.0.0.0/8,cattle-system.svc" sh -s - --server https://<server> --label 'cattle.io/os=linux' --token <token> --ca-checksum <checksum> --etcd --controlplane --worker

Expected Result

rke2/k3s should be able to provision when using a proxy

Screenshots

Additional context

  • rke1 with the same settings will work
  • multi-node doesn't have the same logs, but ends up in the same state
  • rke2/k3s logs on the node appear to be the same issue
  • custom cluster rke2/k3s without proxy settings on the same setup works fine

rancherbot avatar Sep 21 '22 19:09 rancherbot

Is this a regression from v2.6.8 or was it always broken?

Oats87 avatar Sep 21 '22 19:09 Oats87

I believe it was working on 2.6.3, but I have not tested that yet it is not working on 2.6.8, so this will go in 2.6.10 and not block the current release

slickwarren avatar Sep 21 '22 22:09 slickwarren

system-agent PR: https://github.com/rancher/system-agent/pull/97 wins PR: https://github.com/rancher/wins/pull/145

jakefhyde avatar Sep 29 '22 20:09 jakefhyde

reopening for now, as on 2.6-head (3952135) I'm still seeing the same issue. It appears that system-agent is still on v0.2.10, and the downstream node still has the error:

rancher-system-agent.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

slickwarren avatar Sep 29 '22 21:09 slickwarren

Moving back to test, https://github.com/rancher/system-agent/releases/tag/v0.2.11 was erroneously marked as pre-release.

jakefhyde avatar Sep 29 '22 22:09 jakefhyde

did more testing with v0.2.11 however unfortunately the same error occurs.

slickwarren avatar Sep 30 '22 17:09 slickwarren

tested on v2.6-head (e139976) in both a public rancher and rancher behind proxy:

  • provision rke2 cluster with default settings -- pass
  • provision rke2 cluster using a proxy -- pass
  • provision rke2 dedicated node-per-role cluster using a proxy -- pass
  • provision k3s cluster with default settings -- pass
  • provision k3s cluster using a proxy -- pass
  • provision k3s dedicated node-per-role cluster using a proxy -- pass also performed basic network checks on each cluster -- pass note: k3s v1.24.4+k3s1 has a known issue, but using v1.24.6+k3s1 did provision when using the proxy

slickwarren avatar Oct 12 '22 19:10 slickwarren