nodePlan does not get regenerated after changing private registry configuration
What kind of request is this (question/bug/enhancement/feature request): Bug
Steps to reproduce (least amount of steps as possible): Cluster with AWS providers 1 - setup private registry (incorrect url) 2 - remove private registry configuration 3 - cattle-node-agent in master node stopped working with the following error
time="2021-01-08T13:19:30Z" level=info msg="Connecting to proxy" url="wss://rancher.prod.techops/v3/connect" time="2021-01-08T13:19:30Z" level=debug msg="Get agent config: &rkeworker.NodeConfig{ClusterName:"c-rmmb9", Certs:"", Processes:map[string]v3.Process{"share-mnt":v3.Process{Name:"share-mnt", Command:[]string(nil), Args:[]string{"--", "share-root.sh", "docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run https:/registry-cache.ingress.usw2/rancher/rancher-agent:v2.4.5 --server https://rancher.prod.techops.usw2 --token fq8njpbgl4t...x5c4ztgklrpt7w...9sldp8s --ca-checksum 0c4b906e4b24789a033e40dfbad2702a1524863cff01a09710aa7f8674480804 --no-register --only-write-certs --node-name rancher-control-usw2a-1", "/var/lib/kubelet", "/var/lib/rancher"}, Env:[]string(nil), Image:"https:/registry-cache.ingress.usw2.upwork/rancher/rancher-agent:v2.4.5", ImageRegistryAuthConfig:"", VolumesFrom:[]string(nil), Binds:[]string{"/var/run:/var/run"}, NetworkMode:"host", RestartPolicy:"always", PidMode:"host", Privileged:true, HealthCheck:v3.HealthCheck{URL:""}, Labels:map[string]string(nil), Publish:[]string(nil), User:""}}, Files:[]v3.File(nil), NodeVersion:0, AgentCheckInterval:120}" time="2021-01-08T13:19:30Z" level=error msg="Remotedialer proxy error" error="Error response from daemon: invalid reference format"
Note this problem only affect control plane nodes (all).
Result:
Kubernetes clusters is up and running but rancher cant talk with master nodes. Rancher shows message about that "Failed to reconcile etcd plane: Etcd plane nodes are replaced. Stopping provisioning. Please restore your cluster from backup."
Node APi show the following info
nodePlan": { "agentCheckInterval": 120, "plan": { "processes": { "share-mnt": { "args": [ 5 items "--", "share-root.sh", "docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run https:/registry-cache.ingress.usw2/rancher/rancher-agent:v2.4.5 --server https://rancher.prod.techops... --token fq8njpbgl4tw.....9sldp8s --ca-checksum 0c4b906e4b24789a033e40dfbad27....710aa7f8674480804 --no-register --only-write-certs --node-name rancher-control-usw2a-1", "/var/lib/kubelet", "/var/lib/rancher" ], "binds": [ "/var/run:/var/run" ], "healthCheck": { "type": "/v3/schemas/healthCheck" }, "image": "https:/registry-cache.ingress.usw2/rancher/rancher-agent:v2.4.5", "name": "share-mnt", "networkMode": "host", "pidMode": "host", "privileged": true, "restartPolicy": "always", "type": "/v3/schemas/process" } }, "type": "/v3/schemas/rkeConfigNodePlan" }, "type": "/v3/schemas/nodePlan", "version": 1 }, I suspect the cattle agent read the info from the API, so far i couldnt update API nodePlan info.
any idea how can I update it ?
Other details that may be helpful:
Environment information
- Rancher version (
rancher/rancher/rancher/serverimage tag or shown bottom left in the UI): 2.4.5 - Installation option (single install/HA): HA
Cluster information
-
Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Node pools in AWS
-
Machine type (cloud/VM/metal) and specifications (CPU/memory): m5.xlarge
-
Kubernetes version (use
kubectl version): v1.17.4 -
Docker version (use
docker version): docker --version Docker version 19.03.11, build 42e35e61f3
This needs to be edited in the cluster that Rancher is running on. First you need to retrieve the cluster ID and node/machine ID. Easiest way is to retrieve those from the address bar in the browser when the node detail page is opened. The cluster ID is in form of c-xxxxx (based on your log it is c-rmmb9) and the node/machine ID is in form of m-xxxxx (possibly machine-xxxxx).
Make sure the kubeconfig configured to talk to the cluster Rancher is deployed on. Replace $CLUSTERID and $NODEID with the retrieved values.
- Check if you are querying the correct node (this should return the name of the node)
kubectl get nodes.management.cattle.io -n $CLUSTERID $NODEID -o jsonpath='{.status.nodeName}{"\n"}'
- Confirm the nodePlan contains the wrong registry
kubectl get nodes.management.cattle.io -n $CLUSTERID $NODEID -o jsonpath='{.status.nodePlan}'
- Remove the nodePlan, so it can be regenerated with the correct info
kubectl patch nodes.management.cattle.io -n $CLUSTERID $NODEID --type=json -p='[{"op": "remove", "path": "/status/nodePlan"}]'
- Verify the nodePlan is now correctly generated
kubectl get nodes.management.cattle.io -n $CLUSTERID $NODEID -o jsonpath='{.status.nodePlan}'
much appreciated, worked well thank you