rancher
rancher copied to clipboard
[BUG] Rancher can no longer provision harvester machines after restart
Rancher Server Setup
- Rancher version: v2.8.0
- Installation option (Docker install/Helm Chart): as a helm chart on a single-node k3s cluster
- Proxy/Cert Details:
Information about the Cluster
- Infrastructure Provider = Harvester
User Information
- What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
- Admin
Describe the bug
After one of my harvester nodes was unexpected rebooted, rancher is no longer able to provision machines in the upstream harvester HCI infrastructure.
Trying to scale up an existing managed RKE2 cluster from rancher gets the following error:
machine Downloading driver from https://192.168.20.10/assets/docker-machine-driver-harvester
machine Doing /etc/rancher/ssl
machine docker-machine-driver-harvester
machine docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
machine Trying to access option which does not exist
machine THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR
machine Type assertion did not go smoothly to string for key
machine Running pre-create checks...
machine Error with pre-create check: "the server has asked for the client to provide credentials (get settings.harvesterhci.io server-version)"
machine The default lines below are for a sh/bash shell, you can specify the shell you're using, with the --shell flag.
And creating a brand new cluster has a different error:
machine Downloading driver from https://192.168.20.10/assets/docker-machine-driver-harvester
machine Doing /etc/rancher/ssl
machine docker-machine-driver-harvester
machine docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
machine error loading host testing-pool1-31b05da3-dlchl: Docker machine "testing-pool1-31b05da3-dlchl" does not exist. Use "docker-machine ls" to list machines. Use "docker-machine create" to add a new one.
Looks like the connection between Rancher and Harvester is broken?
maybe related to #44929 ?
Seems to occur even after the fix for #44929 , both on scaling and creating a new cluster
And I am on rancher v2.8.2
Looking at the created job (for a worker node scaleup):
"args": [ 8 items
"--driver-download-url=https://<host>/assets/docker-machine-driver-harvester",
"--driver-hash=a9c2847eff3234df6262973cf611a91c3926f3e558118fcd3f4197172eda3434",
"--secret-namespace=fleet-default",
"--secret-name=staging-pool-worker-bbfc2798-d5jsj-machine-state",
"rm",
"-y",
"--update-config",
"staging-pool-worker-bbfc2798-d5jsj"
the first thing the driver tries is to delete the non-exisiting pod and fails.... I would expect a create instead. I just don't know in where this command is generated
I could manually fix it:
- go to the harvester embedded rancher and get the kube config
- update the kubeconfig in the harvester credential in the cattle-global-data namespace in the local cluster (running rancher). they are probably name hv-cred
@bpedersen2 do you have rancher running inside a nested VM or in the same kubernetes cluster of Harvester itself?
Following the manual fix steps by getting the kubeconfig and manually updating the secret in Rancher worked for me!
@bpedersen2 do you have rancher running inside a nested VM or in the same kubernetes cluster of Harvester itself?
No, it is running standalone.
What I observe is that the token in harvester changes.
Rancher is configured to use OIDC, and in the rancher logs I get
Error refreshing token principals, skipping: oauth2: "invalid_grant" "Token is not active"
2024/04/02 11:43:26 [ERROR] [keycloak oidc] GetPrincipal: error creating new http client: oauth2: "invalid_grant" "Token is not active"
2024/04/02 11:43:26 [ERROR] error syncing 'user-XXX': handler mgmt-auth-userattributes-controller: oauth2: "invalid_grant" "Token is not active", requeuing
With a local user, it seems to work
I reregistred the harvester cluster using a non-oidc admin account, now the connections seems to be stable again. It looks like a problem with token expiration to me.
I have the same problem:
Failed creating server [fleet-default/rke2-rc-control-plane-2aae5bdf-2m48z] of kind (HarvesterMachine) for machine rke2-rc-control-plane-5b74797746x4dpcs-ncdxf in infrastructure provider: CreateError: Downloading driver from https://HOST/assets/docker-machine-driver-harvester Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped Trying to access option which does not exist THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR Type assertion did not go smoothly to string for key Running pre-create checks... Error with pre-create check: "the server has asked for the client to provide credentials (get settings.harvesterhci.io server-version)" The default lines below are for a sh/bash shell, you can specify the shell you're using, with the --shell flag.
Rancher v2.8.2 Dashboard v2.8.0 Helm v2.16.8-rancher2 Machine v0.15.0-rancher106 Harvester: v1.2.1
I have loop for many hours: New VM is created, error, new VM is deleted, and new VM is created and again error and again new VM is deleted...
I could manually fix it:
- go to the harvester embedded rancher and get the kube config
- update the kubeconfig in the harvester credential in the cattle-global-data namespace in the local cluster (running rancher). they are probably name hv-cred
OK that's worked for me. I have Rancher with users provided by Active Directory.
Now i have this error:
Failed deleting server [fleet-default/rke2-rc-control-plane-3fba9236-dxptf] of kind (HarvesterMachine) for machine rke2-rc-control-plane-77f9455c9dx9xgsk-4kcwf in infrastructure provider: DeleteError: Downloading driver from https://HOST/assets/docker-machine-driver-harvester Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped About to remove rke2-rc-control-plane-3fba9236-dxptf WARNING: This action will delete both local reference and remote instance. Error removing host "rke2-rc-control-plane-3fba9236-dxptf": the server has asked for the client to provide credentials (get virtualmachines.kubevirt.io rke2-rc-control-plane-3fba9236-dxptf)
Hi, thanks for this bug report. May I ask which Harvester versions you were using, @bpedersen2, @sarahhenkens and when you last updated them?
I am on harvester 1.2.1 and rancher 2.8.3 ( and waiting for 1.2.2 to be able to upgrade to 1.3.x eventually)
Ran in the same issue today as @dawid10353. I'm running harvester 1.2.1 and rancher 2.8.2.
Could you please check the expiry of your API access tokens? There needs to be a kubeconfig token that isn't expired and is associated with the Harvester cluster.
It would also be helpful to know what you were trying to do when you observed the problems and at which step in the process the problems started to occur.
I've managed to reproduce the issue in a test environment:
- Install Harvester (tested with v1.2.1, likely irrelevant)
- Install Rancher (tested with v2.8.2, testing with other versions TBD)
- Import Harvester cluster
- Upload a suitable cloud image and create a VM network, so VMs can be created
- To shorten the time to reproduce, set the default token TTL in Rancher to e.g. 10 minutes. This is a global config setting in Rancher.
- Create a Cloud Credential for the Harvester cluster
- Create a K8s cluster with the Harvester cluster as infrastructure provider, using the previously created Cloud Credential for authentication
- Wait until the default token TTL is over. The token associated with the Cloud Credentials will be expired and eventually removed, but the Cloud Credential will remain. This will not cause an error just yet though.
- Scale the K8s cluster from step 8 up or down. This operation will fail with behavior and errors similar to the reported problem.
I'm not sure if and how OIDC interacts here, since it wasn't in my test environment. Since the original bug report does not include any mention of an external identity provider and it makes my test environment simpler, I'll focus on locally provided users.
As a workaround, I suggest to create Cloud Credentials associated with a token without an expiration date.
To do that, set the maximum token TTL and the default token TTL settings (both global settings in Rancher) both to 0
. Then create the Cloud Credentials to be used to create a K8s cluster on Harvester. Then create a K8s cluster using these Cloud Credentials.
To recover an existing cluster, adjust the maximum token TTL and default token TTL to 0
, create a new Cloud Credential for the Harvester cluster and edit the Yaml for the cluster, such that .spec.cloudCredentialSecretName
points to the new Cloud Credentials.
The K8s cluster will eventually recover and any outstanding scaling operation will be completed. The old Cloud Credentials can be disposed of afterwards.
I was able to replicate the issue by setting the token expiry to 10 minutes as shared by @m-ildefons. The Rancher deployed is v2.8.3, and the Harvester version is v1.2.1. I noticed that when the token associated with the downstream cluster expires, the connection between Rancher and Harvester is disrupted. Well, this token expiry value is set to 30 days by default, as documented here. This change has been introduced in Rancher 2.8, as mentioned in this PR.
I guess this issue has not been observed in earlier versions of Rancher, as the token value was set to infinite, as documented here. It appears that it has been implemented for security reasons, performance enhancements, and to manage too many unexpired tokens.
When reading the original issue https://github.com/rancher/rancher/issues/41919 which introduces the default TTL value (30 minutes for kubeconf) to securely manage tokens for users (or headless users for programming purposes).
However, Harvester cloud credentials, which are used for authenticating and authorizing the Rancher cluster to manage downstream Harvester clusters, should not use the unified TTL applied in this case, as the token is not for users but an internal mechanism instead.
cc @ibrokethecloud @bk201 @Vicente-Cheng @m-ildefons
Same issue here- what is the workaround?
Hi @irony, We've published workaround instructions in our knowledge-base here: https://harvesterhci.io/kb/renew_harvester_cloud_credentials
It's inadvisable to set the default- or maximum-token-ttl settings to 0, because this would weaken security by allowing API tokens to remain valid forever. Hence the workaround is to renew the cloud credentials (and with the cloud credentials the token as well) periodically until a permanent fix has been implemented. By default the tokens expire after 30 days.
@m-ildefons thank you for adding the documentation for the work-around!
- Generate a new Rancher authentication token with the value of Scope set to No Scope. You can customize the TTL for the token (for example, a value of 0 results in tokens that do not expire).
its unclear where we should do this. Is this in the rancher instance from Harvester itself?
I wonder if this bug is related https://github.com/rancher/rancher/issues/45449
My downstream clusters always become really unstable after 30 days and generally just break. Forcing me to restore the entire cluster from backups. Rebooting or touching any control plane nodes is always a gamble whether it will recover or not
does the controlplane node get deleted?
Hi @sarahhenkens ,
thanks for drawing attention to https://github.com/rancher/rancher/issues/45449, this issue looks to me like it's the same as this one.
its unclear where we should do this. Is this in the rancher instance from Harvester itself?
No, this is supposed to be done in the Rancher Manager instance that is used to manage the downstream clusters. It may be hosted on the Harvester cluster itself using a VM or the vcluster plugin, but it can also be completely separate from the Harvester cluster.
Let me draw a diagram to give you a clearer picture:
┌────────────────────────────────────────────────────────────────────────┐
│Harvester Cluster │
│ │
│ ┌─────────────────────────────────────────────────────┤
┌─────────►│ │Workloads │
│access to │ │ │
│ ┌──────►│ │ │
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
│ │ │ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┤
│ │ │ │ │Guest Cluster 1│ │Guest Cluster 2│ │Guest Cluster 3│
│ │ │ │ │ │ │ │ │ │
│ │ ├────────────────┐ │ │ │ │ │ │ │
│ │ │Integrated │ │ │ │ │ │ │ │
│ │ │Rancher │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ ▲ │ │ ▲ │ │ ▲ │
│ │ │ │ │ │ │ │ │ │ │ │ │ │
│ │ └────────────────┴─┴─┴──────┼────────┴─┴──────┼────────┴─┴───────┼───────┘
│ │ │ │ │
│ │ │ │ │
│ │ └─────────────────┤ │
│ │ │ │
│ │ ┌─────────────────────────────────────────┐ │ │
│ │ │Rancher Manager (MCM) │ │ │
│ │ │ │ │ │
│ │ │ ┌──────────────────────┐ (used for) │ │ │
│ └───────┼─┤Cloud Creds 1 ├────────────────┼───┘ │
│ │ ├──────────────────────┤ │ │
│ │ │Token 1 │ │ │
│ │ └──────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────┐ │ (used for) │
└──────────┼─┤Cloud Creds 2 ├────────────────┼──────────────────────┘
│ ├──────────────────────┤ │
│ │Token 2 │ │
│ └──────────────────────┘ │
│ │
└─────────────────────────────────────────┘
You haven't provided exact details of your environment, but I assume it looks similar to the above.
For your Harvester cluster, you'll have one or more cloud credentials in your Rancher Manager (MCM) instance. Each of the Cloud Credential objects is associated with a token, provides access to the Harvester cluster and may be used for managing one or more downstream clusters. If the token expires, the Rancher Manager instance loses its ability to use that Cloud Credential object to authenticate actions for managing the associated downstream cluster's VMs. Any scaling or management operation that interacts with the VMs of that downstream cluster will fail from then on, until the token is replaced for the Cloud Credential object.
By default the tokens of the Cloud Credentials expire 30 days after the Cloud Credential object has been created, which is why your downstream clusters become fragile after 30 days.
You can change the default values for the expiration using the setting kubeconfig-default-token-ttl-minutes
in the settings of the Rancher Manager (MCM) instance, but keep the security implications in mind.
To avoid weakening security, the tokens have to be rotated on a regular basis. The KB article describes how to do that. The workaround can also be applied to recover downstream clusters that are already affected by expired tokens.
There is indeed an integrated Rancher instance in Harvester, but that shouldn't be the one used to manage downstream clusters.
I hope this clears up the picture.
Harvester would like this in for their 1.4.0 version, which lines up with the 2.9.x milestone before October so this would need a backport.
Also 2.8.x.
One comment about the KB. The patch script has variable named CLOUD_CREDENTIAL_NAME
, which implies that the name of the credential needs to be supplied, but in Rancher UI column Name
is the user-friendly one:
And looking at the list in k9s:
Name is the value called ID
in Rancher's UI and is the value needed for the script.
Probably it would be for the best update the KB with this.
I could do that myself, but clicking Edit this page
link gives me 404: