terraform-equinix-metal-anthos-on-vsphere
terraform-equinix-metal-anthos-on-vsphere copied to clipboard
anthos 1.5.0-gke.27 errors in terraform apply
i have been unable to to get anthos 1.5.0-gke.27 to pass "terraform apply" without errors.
here are some of the error messages from the log.
null_resource.anthos_deploy_cluster[0] (remote-exec): null_resource.anthos_deploy_cluster (remote-exec): [K - [FATAL] Hosts for AntiAffinityGroups: Anti-affinity groups enabled with available
null_resource.anthos_deploy_cluster[0] (remote-exec): [0m[0mnull_resource.anthos_deploy_cluster (remote-exec): Some validation results were FATAL. Check report above.
null_resource.anthos_deploy_cluster[0] (remote-exec): [0m[0mnull_resource.anthos_deploy_cluster (remote-exec): Failed to create root cluster: unable to create node Machine Deployments: creating or updating machine deployment "gke-admin-node" in namespace "default": timed out waiting for the condition
null_resource.anthos_deploy_cluster[0] (remote-exec): [0m[0mnull_resource.anthos_deploy_cluster (remote-exec): error: stat /home/ubuntu/cluster/kpresubmit-500-kubeconfig: no such file or directory
null_resource.anthos_deploy_cluster[0] (remote-exec): [1m[31mError: [0m[0m[1merror executing "/tmp/terraform_522095460.sh": Process exited with status 1[0m
I too have the same problem with 1.5.0
Looking at what I can from the admin workstation logs they show:-
I1021 06:53:15.927719 3118 spinner.go:125] Creating node Machines in internal cluster
I1021 06:53:15.932395 3118 clusterclient.go:886] Waiting for machine deployment "default/gke-admin-node" to to be ready, with retry interval "30s" and timeout "45m0s"
I1021 06:53:15.933959 3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: it hasn't yet been seen by controller (observed generation 0 < generation 1)
I1021 06:53:45.935671 3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 0/2 replicas are ready
I1021 06:54:15.935808 3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 0/2 replicas are ready
I1021 06:54:45.935620 3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 0/2 replicas are ready
I1021 06:55:15.935902 3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 1/2 replicas are ready
I1021 06:55:45.935870 3118 clusterclient.go:891] Machine deployment "default/gke-admin-node" is not ready: only 1/2 replicas are ready
The second machine doesn't become ready.
From a describe on the machine objects on the admin cluster I see 1 of the 3 objects not ready:-
ubuntu@admin-workstation:~/cluster$ kubectl --kubeconfig kubeconfig get machine
NAME
gke-admin-master-kn477
gke-admin-node-87d6b48b6-5jcvr
gke-admin-node-87d6b48b6-j294z
With
API Version: cluster.k8s.io/v1alpha1
Kind: Machine
Metadata:
Creation Timestamp: 2020-10-21T06:53:20Z
Finalizers:
machine.cluster.k8s.io
Generate Name: gke-admin-node-87d6b48b6-
Generation: 1
Owner References:
API Version: cluster.k8s.io/v1alpha1
Block Owner Deletion: true
Controller: true
Kind: MachineSet
Name: gke-admin-node-87d6b48b6
UID: 34d6e535-248d-4a1c-b7f5-d805e5007098
Resource Version: 104304
Self Link: /apis/cluster.k8s.io/v1alpha1/namespaces/default/machines/gke-admin-node-87d6b48b6-5jcvr
UID: a0cbae48-7233-402c-9e19-1b2a9b753658
Spec:
Anti Affinity Group: .gke-admin-node-87d6b48b6-4hzxxp
Metadata:
Creation Timestamp: <nil>
Provider Spec:
Value:
API Version: vsphereproviderconfig.k8s.io/v1alpha1
Kind: VsphereMachineProviderConfig
Machine Variables:
Datacenter: Packet
Datastore: datastore1
disk_label: disk0
disk_size: 40
Folder:
Memory: 16384
Network: VM Private Net
num_cpus: 4
resource_pool: Packet-1/Resources/Anthos
vm_template: gke-on-prem-ubuntu-1.5.0-gke.27
Metadata:
Creation Timestamp: <nil>
Network Spec:
Address: <nil>
Dns: <nil>
Ntp:
Use IPAM: false
Vsphere Machine: standard-node
Versions:
Kubelet: 1.17.9-gke.4400
Status:
Failure Domain: host-10
Last Updated: 2020-10-21T08:27:29Z
Phase: Creating
Provider Status:
State: Unavailable
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Powering on machine 4m7s (x13394 over 93m) vsphere-controller Powering on machine gke-admin-node-87d6b48b6-5jcvr
Nodes are:-
ubuntu@admin-workstation:~/cluster$ kubectl --kubeconfig kubeconfig get nodes
NAME STATUS ROLES AGE VERSION
gke-admin-master-kn477 Ready master 108m v1.17.9-gke.4400
gke-admin-node-87d6b48b6-j294z Ready <none> 106m v1.17.9-gke.4400
We may need to open a new issue to update the installation scripts to follow the new gkectl
based instructions offered in https://cloud.google.com/anthos/gke/docs/on-prem/1.5/how-to/install-landing
@displague, are you able to get "terraform apply" to work with anthos 1.5.0 ? has anyone gotten it to work?
I didn't with 1.5.0 with that 1.5.1 works OK most of the time. Sometimes it gets stuck but as I tend to apply/destroy daily I catch transient issues more than most.
@gfthybridlabs, thanks for the tip! i will give 1.5.1-gke.8 a try.
Similar issues were encountered with Anthos GKE on-prem versions: 1.4.3-gke.3, 1.5.2-gke.3, and 1.5.1-gke.8. I was unable to bring up the machine nodes which failed to create always with all the above versions.
Seeking help here.
@parkitibabu , thanks for sharing your data. did you make any subsequent progress?
i am giving up on anthos 1.5.0-gke.27, which i never got to work.
however, i did get anthos 1.5.1-gke.8 to work, with caveats:
-
only works with a 4-node configuration, not 2 nodes as had worked with earlier versions. the 2-node configuration errored out with messages about needing to set antiAffinityGroup to "enabled: false", but as near as i could tell, the yaml files already specified false.
-
i also locally made the change in my pending PR 112 - https://github.com/packet-labs/google-anthos/pull/112 .
this with the current latest rev of the repo, b569d4c .