cluster-api-provider-ibmcloud icon indicating copy to clipboard operation
cluster-api-provider-ibmcloud copied to clipboard

If transit gateway fails creation in PowerVS then fail CAPI deploy

Open hamzy opened this issue 1 year ago • 3 comments

/kind bug /area provider/ibmcloud

What steps did you take and what happened: [A clear and concise description of what the bug is.]

During an IPI CAPI create cluster, a transit gateway is not created. The cluster is useless without this.

What did you expect to happen: Immediate failure.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

{"errors":[{"code":"precondition_failed","message":"cannot add more than 5 gateways to the selected region","more_info":"https://cloud.ibm.com/apidocs/transit-gateway#error-handling"}],"trace":"5261aa71-e822-4340-baef-8c35e6186852"}
E0308 06:25:17.662235 4128998 ibmpowervscluster_controller.go:183]  "msg"="failed to reconcile transit gateway" "error"="error creating transit gateway: cannot add more than 5 gateways to the selected region" "IBMPowerVSCluster"={"name":"rdr-hamzy-test-dal10-58hkl","namespace":"openshift-cluster-api-guests"} "cluster"={"name":"rdr-hamzy-test-dal10-58hkl","namespace":"openshift-cluster-api-guests"} "controller"="ibmpowervscluster" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="IBMPowerVSCluster" "name"="rdr-hamzy-test-dal10-58hkl" "namespace"="openshift-cluster-api-guests" "reconcileID"="3665bd34-7bdb-4785-aae1-a0ed76a199fc"

Environment:

  • Cluster-api version:
  • Minikube/KIND version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

hamzy avatar Mar 08 '24 12:03 hamzy

@hamzy thanks for reporting an issue, can you please dump more information like complete dump of the IBMPowerVSCluster resource.

@Karthik-K-N are we setting right state for the cluster when error happens? This needs discussion how to fail fast when things go wrong! at least we need have some condition or design how many times do we really want to retry if something gets failed to create

mkumatag avatar Mar 08 '24 12:03 mkumatag

[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-installer]$ oc get ibmpowervscluster -n openshift-cluster-api-guests -o yaml
apiVersion: v1
items:
- apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
  kind: IBMPowerVSCluster
  metadata:
    annotations:
      powervs.cluster.x-k8s.io/create-infra: "true"
    creationTimestamp: "2024-03-08T12:24:43Z"
    finalizers:
    - ibmpowervscluster.infrastructure.cluster.x-k8s.io
    generation: 1
    labels:
      cluster.x-k8s.io/cluster-name: rdr-hamzy-test-dal10-58hkl
    name: rdr-hamzy-test-dal10-58hkl
    namespace: openshift-cluster-api-guests
    ownerReferences:
    - apiVersion: cluster.x-k8s.io/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: Cluster
      name: rdr-hamzy-test-dal10-58hkl
      uid: fd6c490c-c444-48e9-93b9-f573c82b1fb4
    resourceVersion: "436"
    uid: d72f51e2-3b1d-4db6-b89f-17d90525c623
  spec:
    controlPlaneEndpoint:
      host: ""
      port: 0
    cosInstance:
      bucketName: rhcos-powervs-images-us-south
      bucketRegion: us-south
      name: rdr-hamzy-test-dal10-58hkl-cos
    network:
      name: rdr-hamzy-test-dal10-58hkl-network
    resourceGroup:
      name: powervs-ipi-resource-group
    serviceInstance:
      id: 701beea6-d79d-4e8a-8e8a-8d122f3754b6
    serviceInstanceID: ""
    transitGateway:
      name: rdr-hamzy-test-dal10-58hkl-tg
    vpc:
      name: rdr-hamzy-test-dal10-58hkl-vpc
      region: us-south
    zone: dal10
  status:
    conditions:
    - lastTransitionTime: "2024-03-08T12:36:39Z"
      status: "True"
      type: NetworkReady
    - lastTransitionTime: "2024-03-08T12:24:45Z"
      status: "True"
      type: ServiceInstanceReady
    - lastTransitionTime: "2024-03-08T12:25:17Z"
      message: 'error creating transit gateway: cannot add more than 5 gateways to
        the selected region'
      reason: TransitGatewayReconciliationFailed
      severity: Error
      status: "False"
      type: TransitGatewayReady
    - lastTransitionTime: "2024-03-08T12:25:07Z"
      status: "True"
      type: VPCReady
    - lastTransitionTime: "2024-03-08T12:25:12Z"
      status: "True"
      type: VPCSubnetReady
    dhcpServer:
      controllerCreated: true
      id: 48a13744-959e-4c58-b3a1-0e3f5941a475
    network:
      controllerCreated: true
      id: 44e09ab9-b84c-4d70-8ac6-da0612f7e8d0
    ready: false
    resourceGroupID:
      controllerCreated: false
      id: c1cb9b2679344ee9951ab8b4bc22eca0
    vpc:
      controllerCreated: true
      id: r006-c5c1eb58-6685-48d3-a324-1885eafbcae9
    vpcSubnet:
      rdr-hamzy-test-dal10-58hkl-vpcsubnet-us-south-1:
        controllerCreated: true
        id: 0717-f8b6ae0b-d076-44c7-aa59-c60e20a7358b
      rdr-hamzy-test-dal10-58hkl-vpcsubnet-us-south-2:
        controllerCreated: true
        id: 0727-128430a8-69a6-4032-b95d-94ebf4603630
      rdr-hamzy-test-dal10-58hkl-vpcsubnet-us-south-3:
        controllerCreated: true
        id: 0737-ed2ea4cf-0958-4c72-82ee-f4994fb7526c
kind: List
metadata:
  resourceVersion: ""

hamzy avatar Mar 08 '24 12:03 hamzy

@hamzy as we can see that condition in the status for the TransitGatewayReady is already set as Error which shows something is wrong with the infra and cluster never becomes active.

Considering the way controllers designed it always looks for making that resource available even after the failure in the next retry. Its user's concise decision when to terminate the cluster based on the conditions or go and fix the environment in the backend to proceed the installation flow(e.g: user talking to admin to bump the limit for the transit gateways in this case)

May be having a timeout in the installer with some level of error checking of these conditions will be a better way to deal with such situations.

mkumatag avatar Mar 08 '24 13:03 mkumatag

as per above comment closing this issue

mkumatag avatar Aug 06 '24 05:08 mkumatag