eks-anywhere icon indicating copy to clipboard operation
eks-anywhere copied to clipboard

Add Analyzer for Bad vSphere Permissions

Open jonathanmeier5 opened this issue 3 years ago • 0 comments

What would you like to be added:

Add an analyzer that detects when cluster creation failed due to incorrect vSphere VM cloning permissions.

Why is this needed:

When provisioning a new management cluster, if your vSphere permissions are set incorrectly such that EKS-A cannot clone a VM for the control plane it is difficult to determine the source of the problem.

The CLI hangs for 1hr+ and then fails with little explanation.

Improving the messaging around this failure mode would improve the product UX.

Right now, to see where the failure is you need to look up this object's definition:

k --kubeconfig mgmt-3/generated/mgmt-3.kind.kubeconfig  get vspheremachines.infrastructure.cluster.x-k8s.io.mgmt-3-control-plane-template-1659617090412-9qsck  -n eksa-system

The object definition will look as follows. Notice the event failure at the bottom.

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachine
metadata:
  annotations:
    cluster.x-k8s.io/cloned-from-groupkind: VSphereMachineTemplate.infrastructure.cluster.x-k8s.io
    cluster.x-k8s.io/cloned-from-name: mgmt-3-control-plane-template-1659617090412
  creationTimestamp: "2022-08-04T12:44:53Z"
  finalizers:
  - vspheremachine.infrastructure.cluster.x-k8s.io
  generation: 1
  labels:
    cluster.x-k8s.io/cluster-name: mgmt-3
    cluster.x-k8s.io/control-plane: ""
  name: mgmt-3-control-plane-template-1659617090412-9qsck
  namespace: eksa-system
  ownerReferences:
  - apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: mgmt-3
    uid: 1dee2b12-e8d2-414c-aac0-3c179de40253
  - apiVersion: cluster.x-k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Machine
    name: mgmt-3-dgtdf
    uid: d9c12533-8d5d-4108-9d2b-78c5bae7ffce
  - apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: VSphereCluster
    name: mgmt-3
    uid: 330c5c94-a082-453d-aaa9-0488ccbc1b1a
  resourceVersion: "1597"
  uid: b83d3d37-2e42-4a31-a6f7-11392e504a90
spec:
  cloneMode: linkedClone
  datacenter: Datacenter
  datastore: /Datacenter/datastore/datastore1
  diskGiB: 25
  folder: /Datacenter/vm/jwmeier/permissiontest
  memoryMiB: 2048
  network:
    devices:
    - dhcp4: true
      networkName: /Datacenter/network/VM Network
  numCPUs: 2
  resourcePool: /Datacenter/host/Cluster-01/Resources/TestResourcePool
  server: ********
  template: /Datacenter/vm/Templates/bottlerocket-v1.22.10-kubernetes-1-22-eks-9-amd64-f18b278
status:
  conditions:
  - lastTransitionTime: "2022-08-04T12:44:57Z"
    message: 'error trigging clone op for machine infrastructure.cluster.x-k8s.io/v1beta1,
      Kind=VSphereVM eksa-system/mgmt-3-dgtdf: ServerFaultCode: Permission to perform
      this operation was denied.'
    reason: CloningFailed
    severity: Warning
    status: "False"
    type: Ready
  - lastTransitionTime: "2022-08-04T12:44:57Z"
    message: 'error trigging clone op for machine infrastructure.cluster.x-k8s.io/v1beta1,
      Kind=VSphereVM eksa-system/mgmt-3-dgtdf: ServerFaultCode: Permission to perform
      this operation was denied.'
    reason: CloningFailed
    severity: Warning
    status: "False"
    type: VMProvisioned

This isn't an urgent need because we are implementing vsphere priv validation and user configuration, but I still want to document the failure mode and note that an analyzer may be useful.

jonathanmeier5 avatar Aug 23 '22 14:08 jonathanmeier5