terraform-provider-rancher2 icon indicating copy to clipboard operation
terraform-provider-rancher2 copied to clipboard

RKE version is not supported on the first run, gets fixed on the second run

Open iTaybb opened this issue 4 years ago • 6 comments

I'm trying to deploy RKE v1.20.6-rancher1-1 with rancher v2.5.8, which should be supported by the release notes.

I'm getting the following error:

Error: RKE version is not supported [v1.20.5-rancher1-1 v1.19.10-rancher1-1 ................... ] got v1.20.6-rancher1-1

Weirdly enough, after re-running the terraform plan, it runs fine, so somehow the v1.20.6-rancher1-1 version is approved after some time.

Might be a race condition of some kind? Maybe rancher is not fully available yet?

iTaybb avatar May 24 '21 06:05 iTaybb

It would seem that when rancher is bootstrapped, it takes some time for the rancher RKE images to become ready, so if you're using terraform to install the rancher instance, bootstrap it, and then attempt to create a cluster, the RKE images might not be ready yet.

By running curl -sSku $TOKEN https://$RANCHER_IP/v3/rkek8ssystemimages | jq -c '.pagination.total' right after bootstraping I can see:

10:06:19  rancher2_bootstrap.admin (local-exec): 122
10:06:22  rancher2_bootstrap.admin (local-exec): 143
10:06:24  rancher2_bootstrap.admin (local-exec): 163
10:06:27  rancher2_bootstrap.admin (local-exec): 168
10:06:29  rancher2_bootstrap.admin (local-exec): 168
10:06:30  rancher2_bootstrap.admin (local-exec): 168

which shows that the images are still loading.

I suggest that rancher2_bootstrap should check that all the rkek8ssystemimages are loaded through the API.

As a workaround, you can probably run some hacky script like this:

#!/bin/bash

LAST_LAST_COUNT=-1
LAST_COUNT=-1
while true; do
    COUNT=$(curl -sSku $TOKEN https://$RANCHER_IP/v3/rkek8ssystemimages | jq -c '.pagination.total')
    echo "$COUNT RKE images loaded."
    [[ $COUNT>0 && "$COUNT" == "$LAST_COUNT" && "$COUNT" == "$LAST_LAST_COUNT" ]] && exit 0
    LAST_LAST_COUNT=$LAST_COUNT
    LAST_COUNT=$COUNT
    sleep 1
done

iTaybb avatar May 30 '21 07:05 iTaybb

@iTaybb , yes, it seems a race condition between bootstrap is done and the local cluster is active. Fix added at PR #679, rancher2_bootstrap resource will wait until local cluster is active

rawmind0 avatar May 31 '21 14:05 rawmind0

PR https://github.com/rancher/terraform-provider-rancher2/pull/679 is already merged. The fix will be available at next tf provider release.

Please, reopen issue if needed.

rawmind0 avatar Jun 21 '21 11:06 rawmind0

@rawmind0 Unfortunately this is still/again happening, see https://github.com/rancher/quickstart/issues/196. I can also reproduce this every 10th time or so.

bashofmann avatar Jan 07 '22 16:01 bashofmann

The issue is happening again in rancher 2.6.3 and terraform provider v1.22.2.

iTaybb avatar Jan 31 '22 10:01 iTaybb

This may or may not work for you, but my fix was to do the following:

# Initialize Rancher server
resource "rancher2_bootstrap" "admin" {
  depends_on = [
    helm_release.rancher_server
  ]

  provider = rancher2.bootstrap

  password  = var.admin_password
  telemetry = true
}

locals {
  rke_network_plugin = "canal"
  rke_network_options = null
}

Then, add this:

resource "time_sleep" "wait_60_seconds" {
  depends_on = [rancher2_bootstrap.admin]
  create_duration = "60s"
}

and on the resource declaration for the workload:

# Create custom managed cluster for amf
resource "rancher2_cluster" "amf_workload" {
  depends_on = [time_sleep.wait_60_seconds]

phillamb168 avatar Jun 15 '22 22:06 phillamb168