terraform-provider-rancher2
terraform-provider-rancher2 copied to clipboard
Intermittently imports of EKS clusters never finish
Versions
- Rancher version: 2.6.8
- Rancher Terraform provider: 1.24.0
- Terraform: 1.2.2
Information about the Cluster
- Kubernetes version: 1.21
- Cluster Type (Local/Downstream): Downstream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Hosted EKS
Describe the bug
Sometimes importing an EKS cluster will never complete (saying "Still creating..." for 30 min then time-out), but the cluster is active in the Rancher instance. Other times it finishes in seconds. To Reproduce
Using this code to import the cluster. The aws-auth configMap has already been updated with the user referred to by the cloud_credential.
resource "rancher2_cloud_credential" "this" {
name = var.name_prefix
description = "Credentials used for managing ${var.name_prefix}"
amazonec2_credential_config {
access_key = aws_iam_access_key.rancher.id
secret_key = aws_iam_access_key.rancher.secret
}
}
resource "rancher2_cluster" "imported_eks_cluster" {
name = var.cluster_id
description = "Terraform EKS cluster"
eks_config_v2 {
cloud_credential_id = rancher2_cloud_credential.this.id
name = var.cluster_id
region = var.region
imported = true
}
}
Result Sometimes this happens until the time-out but the cluster is active in Rancher:
module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster: Still creating... [10m40s elapsed]
module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster: Still creating... [10m50s elapsed]
module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster: Still creating... [11m0s elapsed]
module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster: Still creating... [11m10s elapsed]
...
Error: [ERROR] waiting for cluster (c-xfbkg) to be created: timeout while waiting for state to become 'pending' (last state: 'active', timeout: 30m0s)
│
│ with module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster,
│ on .terraform/modules/import_to_rancher/main.tf line 27, in resource "rancher2_cluster" "imported_eks_cluster":
│ 27: resource "rancher2_cluster" "imported_eks_cluster" {
Expected Result
The cluster is consistently imported in a few seconds. Screenshots
Additional context
Hi,
Same here, same error. Importing a new EKS cluster v1.23.10-eks-15b7512 via Terraform.
- Rancher 2.6.8
- Terraform cli 1.3.2
- rancher2 provider v1.24.1
Local Rancher cluster v1.24.4+k3s1
- While TF is waiting (Still creating...[XmXs elapsed]) the cluster is imported successfully in the Rancher console and you can even manage it. Provider is not aware that the cluster import is ready somehow.
Fixing:
- Only destroying and applying TF again fix the issue and the import is successful.
Still experiencing this. Anyone?
[SURE-5616]
I have been seeing this as well, on successful runs it takes seconds, but occasionally this hangs.
I have the same problem also with Rancher 2.7.1 and K8s 1.24
Ran into this today. From provider config: https://github.com/rancher/terraform-provider-rancher2/blob/master/rancher2/resource_rancher2_cluster.go#L135
expectedState := "active"
if cluster.Driver == clusterDriverImported || (cluster.Driver == clusterDriverEKSV2 && cluster.EKSConfig.Imported) {
expectedState = "pending"
}
it appears provider expects state to become pending first. However, if rancher side is faster than provider polling loop then rancher cluster may become active so fast that provider misses it. From what limited understanding of Go I have, I understand that it would actually possible to wait for multiple targets in
stateConf := &resource.StateChangeConf{
Pending: []string{},
Target: []string{expectedState},
Refresh: clusterStateRefreshFunc(client, newCluster.ID),
Timeout: d.Timeout(schema.TimeoutCreate),
Delay: 1 * time.Second,
MinTimeout: 3 * time.Second,
}
_, waitErr := stateConf.WaitForState()
if waitErr != nil {
return fmt.Errorf("[ERROR] waiting for cluster (%s) to be created: %s", newCluster.ID, waitErr)
}
If for EKS it would be allowed to test against both pending and active targets, this probably could be fixed?
As workaround for those who think they need to destroy their entire state to reimport, I was able to get away with just removing rancher2_cluster via
terraform state rm rancher2_cluster.mycluster
and then import it via
terraform import rancher2_cluster.mycluster c-abcd
Thus I didn't need to kill everything terraform had managed to provision so far. Seemed working.
Good catch @herrbpl. I thought it would be something like that considering it sometimes works.
Hey,
I have been testing this locally and was not able to reproduce it after trying and applying it multiple times (maybe I was lucky).
All tests that I have done with versions:
Test1:
- Rancher version: 2.6.8
- Rancher Terraform provider: 1.24.0
- Terraform: 1.2.2
- Kubernetes version: 1.22 (this was the oldest available version in EKS)
- Cluster Type (Local/Downstream): Downstream Hosted EKS
- Local Rancher cluster v1.24.4+k3s1
Test 2:
- Rancher version: 2.6.8
- Rancher Terraform provider: 1.24.1
- Terraform: 1.3.2
- Kubernetes version: 1.24
- Cluster Type (Local/Downstream): Downstream Hosted EKS
- Local Rancher cluster v1.24.4+k3s1
Also, there is a PR that I submitted https://github.com/rancher/terraform-provider-rancher2/pull/1114 that tries to fix this issue
Just tried again twice with these versions and both times the terraform apply timed out after 30min but the cluster was live in Rancher after about 60s.
- Rancher version: 2.7.1
- Rancher Terraform provider: 3.0.0
- Terraform: 1.3.3
- EKS downstream K8s: v1.23.17
- Rancher K8s: v1.23.16
We also sometimes encountered the problem mentioned at the beginning that the expectedState between Terraform (pending) and the status of the Rancher import (active) did not match.
Our previous workarounds were to avoid the "active" state by, for example, setting up the authorisation or the network connection at a later time. In the end, however, this only resulted in the status in Rancher being "waiting" and also did not match the "pending" expected by the Terraform provider.
In my opinion, "active" should definitely be included in the expectedStates. Whether "waiting" should be part of the expectedState is certainly a topic for discussion and depends on whether the status of a successful import or only the status of a successfully created import "resource" is to be checked here. The latter would also include "waiting", since as soon as all prerequisites have been met, the import continues and hopefully jumps to the "Active" state.
Currently our solution is to use the implemented fix from PR https://github.com/rancher/terraform-provider-rancher2/pull/1114 and we can confirm that it works fine.
Versions used:
- Rancher version: 2.7.4/2.6.10
- Rancher Terraform provider: 3.0.0
- Terraform: 1.4.5
- EKS downstream K8s: v1.24.x/v1.23.x
- Rancher K8s: v1.24.13/v1.23.17
https://github.com/rancher/terraform-provider-rancher2/pull/1114 got merged which should fix this issue.
Will be tested via https://github.com/rancher/eks-operator/issues/84
@kkaempf @furkatgofurov7 Sounds great, thank you guys. For QA, post the test steps here just to be clear on how to verify the intermittent import issue is resolved.
For QA, post the test steps here just to be clear on how to verify the intermittent import issue is resolved.
@cpinjani - please link your testplan here once you start working on rancher/eks-operator#84
For QA, post the test steps here just to be clear on how to verify the intermittent import issue is resolved.
@cpinjani - please link your testplan here once you start working on rancher/eks-operator#84
Test Results - https://github.com/rancher/eks-operator/issues/84#issuecomment-1636382046
QA validated