terraform-equinix-metal-anthos-on-vsphere icon indicating copy to clipboard operation
terraform-equinix-metal-anthos-on-vsphere copied to clipboard

anthos 1.5.0-gke.27 errors in terraform apply

Open dfong opened this issue 4 years ago • 8 comments

i have been unable to to get anthos 1.5.0-gke.27 to pass "terraform apply" without errors.

here are some of the error messages from the log.

        null_resource.anthos_deploy_cluster[0] (remote-exec): null_resource.anthos_deploy_cluster (remote-exec): [K    - [FATAL] Hosts for AntiAffinityGroups: Anti-affinity groups enabled with available
        null_resource.anthos_deploy_cluster[0] (remote-exec): [0m[0mnull_resource.anthos_deploy_cluster (remote-exec): Some validation results were FATAL. Check report above.
        null_resource.anthos_deploy_cluster[0] (remote-exec): [0m[0mnull_resource.anthos_deploy_cluster (remote-exec): Failed to create root cluster: unable to create node Machine Deployments: creating or updating machine deployment "gke-admin-node" in namespace "default": timed out waiting for the condition
        null_resource.anthos_deploy_cluster[0] (remote-exec): [0m[0mnull_resource.anthos_deploy_cluster (remote-exec): error: stat /home/ubuntu/cluster/kpresubmit-500-kubeconfig: no such file or directory
        null_resource.anthos_deploy_cluster[0] (remote-exec): [1m[31mError: [0m[0m[1merror executing "/tmp/terraform_522095460.sh": Process exited with status 1[0m

dfong avatar Oct 20 '20 19:10 dfong

I too have the same problem with 1.5.0

Looking at what I can from the admin workstation logs they show:-

I1021 06:53:15.927719    3118 spinner.go:125] Creating node Machines in internal cluster
I1021 06:53:15.932395    3118 clusterclient.go:886] Waiting for machine deployment "default/gke-admin-node" to to be ready, with retry interval "30s" and timeout "45m0s"
I1021 06:53:15.933959    3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: it hasn't yet been seen by controller (observed generation 0 < generation 1)
I1021 06:53:45.935671    3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 0/2 replicas are ready
I1021 06:54:15.935808    3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 0/2 replicas are ready
I1021 06:54:45.935620    3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 0/2 replicas are ready
I1021 06:55:15.935902    3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 1/2 replicas are ready
I1021 06:55:45.935870    3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 1/2 replicas are ready

The second machine doesn't become ready.

From a describe on the machine objects on the admin cluster I see 1 of the 3 objects not ready:-

ubuntu@admin-workstation:~/cluster$ kubectl --kubeconfig kubeconfig get machine
NAME
gke-admin-master-kn477
gke-admin-node-87d6b48b6-5jcvr
gke-admin-node-87d6b48b6-j294z

With

API Version:  cluster.k8s.io/v1alpha1
Kind:         Machine
Metadata:
  Creation Timestamp:  2020-10-21T06:53:20Z
  Finalizers:
    machine.cluster.k8s.io
  Generate Name:  gke-admin-node-87d6b48b6-
  Generation:     1
  Owner References:
    API Version:           cluster.k8s.io/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  MachineSet
    Name:                  gke-admin-node-87d6b48b6
    UID:                   34d6e535-248d-4a1c-b7f5-d805e5007098
  Resource Version:        104304
  Self Link:               /apis/cluster.k8s.io/v1alpha1/namespaces/default/machines/gke-admin-node-87d6b48b6-5jcvr
  UID:                     a0cbae48-7233-402c-9e19-1b2a9b753658
Spec:
  Anti Affinity Group:  .gke-admin-node-87d6b48b6-4hzxxp
  Metadata:
    Creation Timestamp:  <nil>
  Provider Spec:
    Value:
      API Version:  vsphereproviderconfig.k8s.io/v1alpha1
      Kind:         VsphereMachineProviderConfig
      Machine Variables:
        Datacenter:     Packet
        Datastore:      datastore1
        disk_label:     disk0
        disk_size:      40
        Folder:
        Memory:         16384
        Network:        VM Private Net
        num_cpus:       4
        resource_pool:  Packet-1/Resources/Anthos
        vm_template:    gke-on-prem-ubuntu-1.5.0-gke.27
      Metadata:
        Creation Timestamp:  <nil>
      Network Spec:
        Address:        <nil>
        Dns:            <nil>
        Ntp:
        Use IPAM:       false
      Vsphere Machine:  standard-node
  Versions:
    Kubelet:  1.17.9-gke.4400
Status:
  Failure Domain:  host-10
  Last Updated:    2020-10-21T08:27:29Z
  Phase:           Creating
  Provider Status:
  State:  Unavailable
Events:
  Type    Reason               Age                     From                Message
  ----    ------               ----                    ----                -------
  Normal  Powering on machine  4m7s (x13394 over 93m)  vsphere-controller  Powering on machine gke-admin-node-87d6b48b6-5jcvr

Nodes are:-

ubuntu@admin-workstation:~/cluster$ kubectl --kubeconfig kubeconfig get nodes
NAME                             STATUS   ROLES    AGE    VERSION
gke-admin-master-kn477           Ready    master   108m   v1.17.9-gke.4400
gke-admin-node-87d6b48b6-j294z   Ready    <none>   106m   v1.17.9-gke.4400

PsychoSid avatar Oct 21 '20 08:10 PsychoSid

We may need to open a new issue to update the installation scripts to follow the new gkectl based instructions offered in https://cloud.google.com/anthos/gke/docs/on-prem/1.5/how-to/install-landing

displague avatar Nov 06 '20 17:11 displague

@displague, are you able to get "terraform apply" to work with anthos 1.5.0 ? has anyone gotten it to work?

dfong avatar Nov 08 '20 00:11 dfong

I didn't with 1.5.0 with that 1.5.1 works OK most of the time. Sometimes it gets stuck but as I tend to apply/destroy daily I catch transient issues more than most.

gfthybridlabs avatar Nov 09 '20 06:11 gfthybridlabs

@gfthybridlabs, thanks for the tip! i will give 1.5.1-gke.8 a try.

dfong avatar Nov 09 '20 18:11 dfong

Similar issues were encountered with Anthos GKE on-prem versions: 1.4.3-gke.3, 1.5.2-gke.3, and 1.5.1-gke.8. I was unable to bring up the machine nodes which failed to create always with all the above versions.

Seeking help here.

parkitibabu avatar Nov 24 '20 07:11 parkitibabu

@parkitibabu , thanks for sharing your data. did you make any subsequent progress?

dfong avatar Jan 20 '21 19:01 dfong

i am giving up on anthos 1.5.0-gke.27, which i never got to work.

however, i did get anthos 1.5.1-gke.8 to work, with caveats:

  • only works with a 4-node configuration, not 2 nodes as had worked with earlier versions. the 2-node configuration errored out with messages about needing to set antiAffinityGroup to "enabled: false", but as near as i could tell, the yaml files already specified false.

  • i also locally made the change in my pending PR 112 - https://github.com/packet-labs/google-anthos/pull/112 .

this with the current latest rev of the repo, b569d4c .

dfong avatar Jan 21 '21 03:01 dfong