terraform-provider-rancher2 Intermittently imports of EKS clusters never finish

Versions

Rancher version: 2.6.8
Rancher Terraform provider: 1.24.0
Terraform: 1.2.2

Information about the Cluster

Kubernetes version: 1.21
Cluster Type (Local/Downstream): Downstream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Hosted EKS

Describe the bug

Sometimes importing an EKS cluster will never complete (saying "Still creating..." for 30 min then time-out), but the cluster is active in the Rancher instance. Other times it finishes in seconds. To Reproduce

Using this code to import the cluster. The aws-auth configMap has already been updated with the user referred to by the cloud_credential.

resource "rancher2_cloud_credential" "this" {
  name        = var.name_prefix
  description = "Credentials used for managing ${var.name_prefix}"
  amazonec2_credential_config {
    access_key = aws_iam_access_key.rancher.id
    secret_key = aws_iam_access_key.rancher.secret
  }
}

resource "rancher2_cluster" "imported_eks_cluster" {
  name        = var.cluster_id
  description = "Terraform EKS cluster"
  eks_config_v2 {
    cloud_credential_id = rancher2_cloud_credential.this.id
    name                = var.cluster_id
    region              = var.region
    imported            = true
  }
}

Result Sometimes this happens until the time-out but the cluster is active in Rancher:

module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster: Still creating... [10m40s elapsed]
module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster: Still creating... [10m50s elapsed]
module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster: Still creating... [11m0s elapsed]
module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster: Still creating... [11m10s elapsed]
...
Error: [ERROR] waiting for cluster (c-xfbkg) to be created: timeout while waiting for state to become 'pending' (last state: 'active', timeout: 30m0s)
│ 
│   with module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster,
│   on .terraform/modules/import_to_rancher/main.tf line 27, in resource "rancher2_cluster" "imported_eks_cluster":
│   27: resource "rancher2_cluster" "imported_eks_cluster" {

Expected Result

The cluster is consistently imported in a few seconds. Screenshots

Additional context

Oct 12 '22 14:10 tulanian

Hi,

Same here, same error. Importing a new EKS cluster v1.23.10-eks-15b7512 via Terraform.

Rancher 2.6.8
Terraform cli 1.3.2
rancher2 provider v1.24.1

Local Rancher cluster v1.24.4+k3s1

While TF is waiting (Still creating...[XmXs elapsed]) the cluster is imported successfully in the Rancher console and you can even manage it. Provider is not aware that the cluster import is ready somehow.

Fixing:

Only destroying and applying TF again fix the issue and the import is successful.

Oct 13 '22 06:10 ghost

Still experiencing this. Anyone?

Nov 02 '22 14:11 tulanian

[SURE-5616]

Nov 25 '22 21:11 tbernacchi

I have been seeing this as well, on successful runs it takes seconds, but occasionally this hangs.

Jan 31 '23 17:01 code-eg

I have the same problem also with Rancher 2.7.1 and K8s 1.24

Feb 17 '23 09:02 duemila2

Ran into this today. From provider config: https://github.com/rancher/terraform-provider-rancher2/blob/master/rancher2/resource_rancher2_cluster.go#L135

	expectedState := "active"

	if cluster.Driver == clusterDriverImported || (cluster.Driver == clusterDriverEKSV2 && cluster.EKSConfig.Imported) {
		expectedState = "pending"
	}

it appears provider expects state to become pending first. However, if rancher side is faster than provider polling loop then rancher cluster may become active so fast that provider misses it. From what limited understanding of Go I have, I understand that it would actually possible to wait for multiple targets in

       stateConf := &resource.StateChangeConf{
		Pending:    []string{},
		Target:     []string{expectedState},
		Refresh:    clusterStateRefreshFunc(client, newCluster.ID),
		Timeout:    d.Timeout(schema.TimeoutCreate),
		Delay:      1 * time.Second,
		MinTimeout: 3 * time.Second,
	}
	_, waitErr := stateConf.WaitForState()
	if waitErr != nil {
		return fmt.Errorf("[ERROR] waiting for cluster (%s) to be created: %s", newCluster.ID, waitErr)
	}

If for EKS it would be allowed to test against both pending and active targets, this probably could be fixed?

Mar 29 '23 09:03 herrbpl

As workaround for those who think they need to destroy their entire state to reimport, I was able to get away with just removing rancher2_cluster via

terraform state rm rancher2_cluster.mycluster

and then import it via

terraform import rancher2_cluster.mycluster c-abcd

Thus I didn't need to kill everything terraform had managed to provision so far. Seemed working.

Mar 29 '23 09:03 herrbpl

Good catch @herrbpl. I thought it would be something like that considering it sometimes works.

Mar 29 '23 10:03 tulanian

Hey,

I have been testing this locally and was not able to reproduce it after trying and applying it multiple times (maybe I was lucky).

All tests that I have done with versions:

Test1:

Rancher version: 2.6.8
Rancher Terraform provider: 1.24.0
Terraform: 1.2.2
Kubernetes version: 1.22 (this was the oldest available version in EKS)
Cluster Type (Local/Downstream): Downstream Hosted EKS
Local Rancher cluster v1.24.4+k3s1

Screenshot 2023-05-04 at 11 55 56

Test 2:

Rancher version: 2.6.8
Rancher Terraform provider: 1.24.1
Terraform: 1.3.2
Kubernetes version: 1.24
Cluster Type (Local/Downstream): Downstream Hosted EKS
Local Rancher cluster v1.24.4+k3s1

Screenshot 2023-05-04 at 12 58 02

Also, there is a PR that I submitted https://github.com/rancher/terraform-provider-rancher2/pull/1114 that tries to fix this issue

May 04 '23 14:05 furkatgofurov7

Just tried again twice with these versions and both times the terraform apply timed out after 30min but the cluster was live in Rancher after about 60s.

Rancher version: 2.7.1
Rancher Terraform provider: 3.0.0
Terraform: 1.3.3
EKS downstream K8s: v1.23.17
Rancher K8s: v1.23.16

May 05 '23 09:05 tulanian

We also sometimes encountered the problem mentioned at the beginning that the expectedState between Terraform (pending) and the status of the Rancher import (active) did not match.

Our previous workarounds were to avoid the "active" state by, for example, setting up the authorisation or the network connection at a later time. In the end, however, this only resulted in the status in Rancher being "waiting" and also did not match the "pending" expected by the Terraform provider.

In my opinion, "active" should definitely be included in the expectedStates. Whether "waiting" should be part of the expectedState is certainly a topic for discussion and depends on whether the status of a successful import or only the status of a successfully created import "resource" is to be checked here. The latter would also include "waiting", since as soon as all prerequisites have been met, the import continues and hopefully jumps to the "Active" state.

Currently our solution is to use the implemented fix from PR https://github.com/rancher/terraform-provider-rancher2/pull/1114 and we can confirm that it works fine.

Versions used:

Rancher version: 2.7.4/2.6.10
Rancher Terraform provider: 3.0.0
Terraform: 1.4.5
EKS downstream K8s: v1.24.x/v1.23.x
Rancher K8s: v1.24.13/v1.23.17

Jun 06 '23 09:06 drasang

https://github.com/rancher/terraform-provider-rancher2/pull/1114 got merged which should fix this issue.

Jul 11 '23 10:07 kkaempf

Will be tested via https://github.com/rancher/eks-operator/issues/84

Jul 11 '23 10:07 kkaempf

@kkaempf @furkatgofurov7 Sounds great, thank you guys. For QA, post the test steps here just to be clear on how to verify the intermittent import issue is resolved.

Jul 11 '23 14:07 a-blender

For QA, post the test steps here just to be clear on how to verify the intermittent import issue is resolved.

@cpinjani - please link your testplan here once you start working on rancher/eks-operator#84

Jul 13 '23 15:07 kkaempf

For QA, post the test steps here just to be clear on how to verify the intermittent import issue is resolved.

@cpinjani - please link your testplan here once you start working on rancher/eks-operator#84

Test Results - https://github.com/rancher/eks-operator/issues/84#issuecomment-1636382046

Jul 14 '23 20:07 cpinjani

QA validated

Jul 18 '23 10:07 kkaempf

terraform-provider-rancher2 terraform-provider-rancher2 copied to clipboard

Intermittently imports of EKS clusters never finish

Test 2:

terraform-provider-rancher2
terraform-provider-rancher2 copied to clipboard