rancher icon indicating copy to clipboard operation
rancher copied to clipboard

[BUG] Rancher can no longer provision harvester machines after restart

Open sarahhenkens opened this issue 11 months ago • 21 comments

Rancher Server Setup

  • Rancher version: v2.8.0
  • Installation option (Docker install/Helm Chart): as a helm chart on a single-node k3s cluster
  • Proxy/Cert Details:

Information about the Cluster

  • Infrastructure Provider = Harvester

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • Admin

Describe the bug

After one of my harvester nodes was unexpected rebooted, rancher is no longer able to provision machines in the upstream harvester HCI infrastructure.

Trying to scale up an existing managed RKE2 cluster from rancher gets the following error:

 machine Downloading driver from https://192.168.20.10/assets/docker-machine-driver-harvester
 machine Doing /etc/rancher/ssl
 machine docker-machine-driver-harvester
 machine docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
 machine Trying to access option  which does not exist
 machine THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR
 machine Type assertion did not go smoothly to string for key
 machine Running pre-create checks...
 machine Error with pre-create check: "the server has asked for the client to provide credentials (get settings.harvesterhci.io server-version)"
 machine The default lines below are for a sh/bash shell, you can specify the shell you're using, with the --shell flag.

And creating a brand new cluster has a different error:

 machine Downloading driver from https://192.168.20.10/assets/docker-machine-driver-harvester
 machine Doing /etc/rancher/ssl
 machine docker-machine-driver-harvester
 machine docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped
 machine error loading host testing-pool1-31b05da3-dlchl: Docker machine "testing-pool1-31b05da3-dlchl" does not exist. Use "docker-machine ls" to list machines. Use "docker-machine create" to add a new one.

Looks like the connection between Rancher and Harvester is broken?

sarahhenkens avatar Mar 24 '24 18:03 sarahhenkens

maybe related to #44929 ?

bpedersen2 avatar Mar 27 '24 12:03 bpedersen2

Seems to occur even after the fix for #44929 , both on scaling and creating a new cluster

bpedersen2 avatar Mar 27 '24 17:03 bpedersen2

And I am on rancher v2.8.2

bpedersen2 avatar Mar 27 '24 17:03 bpedersen2

Looking at the created job (for a worker node scaleup):

"args": [ 8 items
"--driver-download-url=https://<host>/assets/docker-machine-driver-harvester",
"--driver-hash=a9c2847eff3234df6262973cf611a91c3926f3e558118fcd3f4197172eda3434",
"--secret-namespace=fleet-default",
"--secret-name=staging-pool-worker-bbfc2798-d5jsj-machine-state",
"rm",
"-y",
"--update-config",
"staging-pool-worker-bbfc2798-d5jsj"

the first thing the driver tries is to delete the non-exisiting pod and fails.... I would expect a create instead. I just don't know in where this command is generated

bpedersen2 avatar Mar 28 '24 08:03 bpedersen2

I could manually fix it:

  1. go to the harvester embedded rancher and get the kube config
  2. update the kubeconfig in the harvester credential in the cattle-global-data namespace in the local cluster (running rancher). they are probably name hv-cred

bpedersen2 avatar Mar 28 '24 14:03 bpedersen2

@bpedersen2 do you have rancher running inside a nested VM or in the same kubernetes cluster of Harvester itself?

sarahhenkens avatar Mar 29 '24 19:03 sarahhenkens

Following the manual fix steps by getting the kubeconfig and manually updating the secret in Rancher worked for me!

sarahhenkens avatar Mar 29 '24 19:03 sarahhenkens

@bpedersen2 do you have rancher running inside a nested VM or in the same kubernetes cluster of Harvester itself?

No, it is running standalone.

bpedersen2 avatar Apr 02 '24 06:04 bpedersen2

What I observe is that the token in harvester changes.

Rancher is configured to use OIDC, and in the rancher logs I get

Error refreshing token principals, skipping: oauth2: "invalid_grant" "Token is not active"
2024/04/02 11:43:26 [ERROR] [keycloak oidc] GetPrincipal: error creating new http client: oauth2: "invalid_grant" "Token is not active"
2024/04/02 11:43:26 [ERROR] error syncing 'user-XXX': handler mgmt-auth-userattributes-controller: oauth2: "invalid_grant" "Token is not active", requeuing

With a local user, it seems to work

bpedersen2 avatar Apr 02 '24 11:04 bpedersen2

I reregistred the harvester cluster using a non-oidc admin account, now the connections seems to be stable again. It looks like a problem with token expiration to me.

bpedersen2 avatar Apr 03 '24 06:04 bpedersen2

I have the same problem:

Failed creating server [fleet-default/rke2-rc-control-plane-2aae5bdf-2m48z] of kind (HarvesterMachine) for machine rke2-rc-control-plane-5b74797746x4dpcs-ncdxf in infrastructure provider: CreateError: Downloading driver from https://HOST/assets/docker-machine-driver-harvester Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped Trying to access option which does not exist THIS ***WILL*** CAUSE UNEXPECTED BEHAVIOR Type assertion did not go smoothly to string for key Running pre-create checks... Error with pre-create check: "the server has asked for the client to provide credentials (get settings.harvesterhci.io server-version)" The default lines below are for a sh/bash shell, you can specify the shell you're using, with the --shell flag.

Rancher v2.8.2 Dashboard v2.8.0 Helm v2.16.8-rancher2 Machine v0.15.0-rancher106 Harvester: v1.2.1

dawid10353 avatar Apr 05 '24 04:04 dawid10353

I have loop for many hours: New VM is created, error, new VM is deleted, and new VM is created and again error and again new VM is deleted...

dawid10353 avatar Apr 05 '24 04:04 dawid10353

I could manually fix it:

  1. go to the harvester embedded rancher and get the kube config
  2. update the kubeconfig in the harvester credential in the cattle-global-data namespace in the local cluster (running rancher). they are probably name hv-cred

OK that's worked for me. I have Rancher with users provided by Active Directory.

dawid10353 avatar Apr 05 '24 04:04 dawid10353

Now i have this error:

	Failed deleting server [fleet-default/rke2-rc-control-plane-3fba9236-dxptf] of kind (HarvesterMachine) for machine rke2-rc-control-plane-77f9455c9dx9xgsk-4kcwf in infrastructure provider: DeleteError: Downloading driver from https://HOST/assets/docker-machine-driver-harvester Doing /etc/rancher/ssl docker-machine-driver-harvester docker-machine-driver-harvester: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, stripped About to remove rke2-rc-control-plane-3fba9236-dxptf WARNING: This action will delete both local reference and remote instance. Error removing host "rke2-rc-control-plane-3fba9236-dxptf": the server has asked for the client to provide credentials (get virtualmachines.kubevirt.io rke2-rc-control-plane-3fba9236-dxptf)

dawid10353 avatar Apr 05 '24 05:04 dawid10353

Hi, thanks for this bug report. May I ask which Harvester versions you were using, @bpedersen2, @sarahhenkens and when you last updated them?

m-ildefons avatar May 07 '24 14:05 m-ildefons

I am on harvester 1.2.1 and rancher 2.8.3 ( and waiting for 1.2.2 to be able to upgrade to 1.3.x eventually)

bpedersen2 avatar May 07 '24 14:05 bpedersen2

Ran in the same issue today as @dawid10353. I'm running harvester 1.2.1 and rancher 2.8.2.

sarahhenkens avatar May 11 '24 17:05 sarahhenkens

Could you please check the expiry of your API access tokens? There needs to be a kubeconfig token that isn't expired and is associated with the Harvester cluster.

Screenshot at 2024-05-13 09-06-44

Screenshot at 2024-05-13 09-07-18

It would also be helpful to know what you were trying to do when you observed the problems and at which step in the process the problems started to occur.

m-ildefons avatar May 13 '24 07:05 m-ildefons

I've managed to reproduce the issue in a test environment:

  1. Install Harvester (tested with v1.2.1, likely irrelevant)
  2. Install Rancher (tested with v2.8.2, testing with other versions TBD)
  3. Import Harvester cluster
  4. Upload a suitable cloud image and create a VM network, so VMs can be created
  5. To shorten the time to reproduce, set the default token TTL in Rancher to e.g. 10 minutes. This is a global config setting in Rancher.
  6. Create a Cloud Credential for the Harvester cluster
  7. Create a K8s cluster with the Harvester cluster as infrastructure provider, using the previously created Cloud Credential for authentication
  8. Wait until the default token TTL is over. The token associated with the Cloud Credentials will be expired and eventually removed, but the Cloud Credential will remain. This will not cause an error just yet though.
  9. Scale the K8s cluster from step 8 up or down. This operation will fail with behavior and errors similar to the reported problem.

I'm not sure if and how OIDC interacts here, since it wasn't in my test environment. Since the original bug report does not include any mention of an external identity provider and it makes my test environment simpler, I'll focus on locally provided users.

As a workaround, I suggest to create Cloud Credentials associated with a token without an expiration date. To do that, set the maximum token TTL and the default token TTL settings (both global settings in Rancher) both to 0. Then create the Cloud Credentials to be used to create a K8s cluster on Harvester. Then create a K8s cluster using these Cloud Credentials.

To recover an existing cluster, adjust the maximum token TTL and default token TTL to 0, create a new Cloud Credential for the Harvester cluster and edit the Yaml for the cluster, such that .spec.cloudCredentialSecretName points to the new Cloud Credentials. The K8s cluster will eventually recover and any outstanding scaling operation will be completed. The old Cloud Credentials can be disposed of afterwards.

m-ildefons avatar May 14 '24 10:05 m-ildefons

I was able to replicate the issue by setting the token expiry to 10 minutes as shared by @m-ildefons. The Rancher deployed is v2.8.3, and the Harvester version is v1.2.1. I noticed that when the token associated with the downstream cluster expires, the connection between Rancher and Harvester is disrupted. Well, this token expiry value is set to 30 days by default, as documented here. This change has been introduced in Rancher 2.8, as mentioned in this PR.

I guess this issue has not been observed in earlier versions of Rancher, as the token value was set to infinite, as documented here. It appears that it has been implemented for security reasons, performance enhancements, and to manage too many unexpired tokens.

khushalchandak17 avatar May 15 '24 20:05 khushalchandak17

When reading the original issue https://github.com/rancher/rancher/issues/41919 which introduces the default TTL value (30 minutes for kubeconf) to securely manage tokens for users (or headless users for programming purposes).

However, Harvester cloud credentials, which are used for authenticating and authorizing the Rancher cluster to manage downstream Harvester clusters, should not use the unified TTL applied in this case, as the token is not for users but an internal mechanism instead.

cc @ibrokethecloud @bk201 @Vicente-Cheng @m-ildefons

innobead avatar May 17 '24 07:05 innobead

Same issue here- what is the workaround?

irony avatar May 31 '24 05:05 irony

Hi @irony, We've published workaround instructions in our knowledge-base here: https://harvesterhci.io/kb/renew_harvester_cloud_credentials

It's inadvisable to set the default- or maximum-token-ttl settings to 0, because this would weaken security by allowing API tokens to remain valid forever. Hence the workaround is to renew the cloud credentials (and with the cloud credentials the token as well) periodically until a permanent fix has been implemented. By default the tokens expire after 30 days.

m-ildefons avatar May 31 '24 07:05 m-ildefons

@m-ildefons thank you for adding the documentation for the work-around!

  1. Generate a new Rancher authentication token with the value of Scope set to No Scope. You can customize the TTL for the token (for example, a value of 0 results in tokens that do not expire).

its unclear where we should do this. Is this in the rancher instance from Harvester itself?

sarahhenkens avatar Jun 02 '24 04:06 sarahhenkens

I wonder if this bug is related https://github.com/rancher/rancher/issues/45449

My downstream clusters always become really unstable after 30 days and generally just break. Forcing me to restore the entire cluster from backups. Rebooting or touching any control plane nodes is always a gamble whether it will recover or not

sarahhenkens avatar Jun 02 '24 04:06 sarahhenkens

does the controlplane node get deleted?

ibrokethecloud avatar Jun 02 '24 23:06 ibrokethecloud

Hi @sarahhenkens ,

thanks for drawing attention to https://github.com/rancher/rancher/issues/45449, this issue looks to me like it's the same as this one.

its unclear where we should do this. Is this in the rancher instance from Harvester itself?

No, this is supposed to be done in the Rancher Manager instance that is used to manage the downstream clusters. It may be hosted on the Harvester cluster itself using a VM or the vcluster plugin, but it can also be completely separate from the Harvester cluster.

Let me draw a diagram to give you a clearer picture:

           ┌────────────────────────────────────────────────────────────────────────┐
           │Harvester Cluster                                                       │
           │                                                                        │
           │                  ┌─────────────────────────────────────────────────────┤
┌─────────►│                  │Workloads                                            │
│access to │                  │                                                     │
│  ┌──────►│                  │                                                     │
│  │       │                  │                                                     │
│  │       │                  │                                                     │
│  │       │                  │                                                     │
│  │       │                  │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┤
│  │       │                  │ │Guest Cluster 1│ │Guest Cluster 2│ │Guest Cluster 3│
│  │       │                  │ │               │ │               │ │               │
│  │       ├────────────────┐ │ │               │ │               │ │               │
│  │       │Integrated      │ │ │               │ │               │ │               │
│  │       │Rancher         │ │ │               │ │               │ │               │
│  │       │                │ │ │               │ │               │ │               │
│  │       │                │ │ │               │ │               │ │               │
│  │       │                │ │ │               │ │               │ │               │
│  │       │                │ │ │               │ │               │ │               │
│  │       │                │ │ │               │ │               │ │               │
│  │       │                │ │ │               │ │               │ │               │
│  │       │                │ │ │      ▲        │ │      ▲        │ │       ▲       │
│  │       │                │ │ │      │        │ │      │        │ │       │       │
│  │       └────────────────┴─┴─┴──────┼────────┴─┴──────┼────────┴─┴───────┼───────┘
│  │                                   │                 │                  │        
│  │                                   │                 │                  │        
│  │                                   └─────────────────┤                  │        
│  │                                                     │                  │        
│  │       ┌─────────────────────────────────────────┐   │                  │        
│  │       │Rancher Manager (MCM)                    │   │                  │        
│  │       │                                         │   │                  │        
│  │       │ ┌──────────────────────┐ (used for)     │   │                  │        
│  └───────┼─┤Cloud Creds 1         ├────────────────┼───┘                  │        
│          │ ├──────────────────────┤                │                      │        
│          │ │Token 1               │                │                      │        
│          │ └──────────────────────┘                │                      │        
│          │                                         │                      │        
│          │ ┌──────────────────────┐                │  (used for)          │        
└──────────┼─┤Cloud Creds 2         ├────────────────┼──────────────────────┘        
           │ ├──────────────────────┤                │                               
           │ │Token 2               │                │                               
           │ └──────────────────────┘                │                               
           │                                         │                               
           └─────────────────────────────────────────┘                               

You haven't provided exact details of your environment, but I assume it looks similar to the above.

For your Harvester cluster, you'll have one or more cloud credentials in your Rancher Manager (MCM) instance. Each of the Cloud Credential objects is associated with a token, provides access to the Harvester cluster and may be used for managing one or more downstream clusters. If the token expires, the Rancher Manager instance loses its ability to use that Cloud Credential object to authenticate actions for managing the associated downstream cluster's VMs. Any scaling or management operation that interacts with the VMs of that downstream cluster will fail from then on, until the token is replaced for the Cloud Credential object.

By default the tokens of the Cloud Credentials expire 30 days after the Cloud Credential object has been created, which is why your downstream clusters become fragile after 30 days. You can change the default values for the expiration using the setting kubeconfig-default-token-ttl-minutes in the settings of the Rancher Manager (MCM) instance, but keep the security implications in mind. To avoid weakening security, the tokens have to be rotated on a regular basis. The KB article describes how to do that. The workaround can also be applied to recover downstream clusters that are already affected by expired tokens.

There is indeed an integrated Rancher instance in Harvester, but that shouldn't be the one used to manage downstream clusters.

I hope this clears up the picture.

m-ildefons avatar Jun 03 '24 08:06 m-ildefons

Harvester would like this in for their 1.4.0 version, which lines up with the 2.9.x milestone before October so this would need a backport.

gaktive avatar Jul 11 '24 23:07 gaktive

Also 2.8.x.

gaktive avatar Jul 12 '24 00:07 gaktive

One comment about the KB. The patch script has variable named CLOUD_CREDENTIAL_NAME, which implies that the name of the credential needs to be supplied, but in Rancher UI column Name is the user-friendly one: image

And looking at the list in k9s: image

Name is the value called ID in Rancher's UI and is the value needed for the script.

Probably it would be for the best update the KB with this. I could do that myself, but clicking Edit this page link gives me 404: image

DovydasNavickas avatar Jul 22 '24 23:07 DovydasNavickas