kubernetes-ovn-heterogeneous-cluster icon indicating copy to clipboard operation
kubernetes-ovn-heterogeneous-cluster copied to clipboard

Error when running install_ovn.ps1 (windows-init.exe)

Open greigs opened this issue 8 years ago • 11 comments

I'm getting the following error when running the install_ovn.ps1 script on the windows host. I'm fairly sure my settings are correct. Does K8S_CLUSTER_ROUTER need to be defined somewhere?

Traceback (most recent call last):
  File "windows-init.py", line 151, in <module>
  File "windows-init.py", line 147, in minion_init
  File "windows-init.py", line 43, in create_management_port
  File "windows-init.py", line 26, in get_k8s_cluster_router
Exception: K8S_CLUSTER_ROUTER not found

greigs avatar Mar 16 '17 16:03 greigs

@aserdean any changes that have not been properly propagated to the zip file hosted by Cloudbase?

pires avatar Mar 16 '17 18:03 pires

Could this error be a symptom of a connection issue? Doesn't seem like it though, the error is returned immediately and not as a result of a timeout.

My setup is as such:

VM setup Network Setup

the machines can ping each other

on the master node (set up as the example):

export HOSTNAME=`hostname`
export K8S_VERSION=1.5.3
export K8S_POD_SUBNET=10.244.0.0/16
export K8S_NODE_POD_SUBNET=10.244.2.0/24
export K8S_DNS_SERVICE_IP=10.100.0.10
export K8S_DNS_DOMAIN=cluster.local

on the windows worker node:

$SUBNET="10.244.2.0/24" # The minion subnet used to spawn pods on
$GATEWAY_IP="10.244.2.1" # first ip of the subnet
$CLUSTER_IP_SUBNET="10.244.0.0/16" # The big subnet which includes the minions subnets
$INTERFACE_ALIAS="Ethernet" # Interface used for creating the overlay tunnels (must have connectivity with other hosts)
$KUBERNETES_API_SERVER="10.142.0.2" # API kubernetes server IP
$PUBLIC_IP="10.142.0.3" # IP of $INTERFACE_ALIAS (must be able to reach other hosts)

greigs avatar Mar 17 '17 10:03 greigs

My guess is that you don't have a gateway node and maybe some late minute changes before GCP Next 2017 demo required it. @alinbalutoiu and @aserdean should know how to identify the root cause better than I do at this point.

pires avatar Mar 17 '17 11:03 pires

The gateway node is listed as the last thing to set up in the doc. Does it possibly just need to be created before the worker nodes?

greigs avatar Mar 17 '17 11:03 greigs

All binaries are unchanged at the moment.

We have another one in the works which allows further logging, but we did not properly test that one yet.

It halts the execution in: https://github.com/alinbalutoiu/ovn_alpha/blob/d60a50e440d9d17da320a8acba1766e3cff31b86/bin/ovn-k8s-overlay#L70-L77.

It either cannot find: https://github.com/alinbalutoiu/ovn_alpha/blob/d60a50e440d9d17da320a8acba1766e3cff31b86/bin/ovn-k8s-overlay#L271 or you are trying to do a gw init on the windows node.

I guess you ran the init scripts a couple of times which might messed up the config. Can you please show us the output of ovn-nbctl show and ovn-sbctl show on the master node?

aserdean avatar Mar 17 '17 12:03 aserdean

I've only ran the init scripts once on the master node. On the windows node I have ran the script multiple times, but I have also deleted and recreated this node multiple times in a trial-and-error fashion trying different network values.

Output:

root@sig-windows-master:~# ovn-nbctl show
root@sig-windows-master:~# ovn-sbctl show
Chassis "3dfb3758-14a6-4b89-ab55-48d1eb391f84"
    hostname: "sig-windows-worker-windows-1"
    Encap geneve
        ip: "10.142.0.3"
        options: {csum="true"}
Chassis "919d7288-6885-4102-801b-db98cb3fcaf2"
    hostname: "sig-windows-worker-windows-1"
    Encap geneve
        ip: "10.142.0.3"
        options: {csum="true"}
Chassis "90f4881c-8740-4aa3-88fd-f140f67535de"
    hostname: "sig-windows-worker-windows-1"
    Encap geneve
        ip: "10.142.0.3"
        options: {csum="true"}
Chassis "63914194-9611-425d-b316-454463d3c6fd"
    hostname: "sig-windows-worker-windows-1"
    Encap geneve
        ip: "10.142.0.3"
        options: {csum="true"}
Chassis "49f459b9-a35e-44d9-aef8-82953b187e8b"
    hostname: "sig-windows-worker-windows-1"
    Encap geneve
        ip: "10.142.0.3"
        options: {csum="true"}
Chassis "049316d1-f697-4cfd-8753-6d6701b2a34e"
    hostname: "sig-windows-worker-windows-1"
    Encap geneve
        ip: "10.142.0.3"
        options: {csum="true"}
Chassis "e6e9f92c-5141-4b0c-b1df-a018d55a6aaa"
    hostname: "sig-windows-master.xxxxxxxxxxxx.internal"
    Encap geneve
        ip: "10.142.0.2"
        options: {csum="true"}
Chassis "240a4793-f40d-43eb-af6c-1bc6d0993cb3"
    hostname: "sig-windows-worker-windows-1"
    Encap geneve
        ip: "10.142.0.3"
        options: {csum="true"}
Chassis "4aaf794b-c3ed-45d3-8196-32d488f570f0"
    hostname: "sig-windows-worker-windows-1"
    Encap geneve
        ip: "10.142.0.3"
        options: {csum="true"}
Chassis "2f45c22d-8dbd-4e0a-9a53-e4ec0da9855e"
    hostname: "sig-windows-worker-windows-1"
    Encap geneve
        ip: "10.142.0.3"
        options: {csum="true"}

greigs avatar Mar 17 '17 12:03 greigs

Thanks for the output :). root@sig-windows-master:~# ovn-nbctl show should have shown you something. Is the service running?

Can you post the logs from ovn-northd by any chance?

Also you can join #sig-windows if you wish, so we can go step by step when you recreate the env.

aserdean avatar Mar 17 '17 12:03 aserdean

root@sig-windows-master:~# kubectl get nodes
NAME                 STATUS                     AGE
sig-windows-master   Ready,SchedulingDisabled   22h
root@sig-windows-master:~# kubectl -n kube-system get pods
NAME                                         READY     STATUS    RESTARTS   AGE
kube-apiserver-sig-windows-master            1/1       Running   0          22h
kube-controller-manager-sig-windows-master   1/1       Running   0          22h
kube-dns-1216797708-9vl76                    0/3       Pending   0          22h
kube-scheduler-sig-windows-master            1/1       Running   0          22h
root@sig-windows-master:~# ovn-northd
2017-03-17T12:52:43Z|00001|reconnect|INFO|unix:/var/run/openvswitch/ovnnb_db.sock: connecting...
2017-03-17T12:52:43Z|00002|reconnect|INFO|unix:/var/run/openvswitch/ovnsb_db.sock: connecting...
2017-03-17T12:52:43Z|00003|reconnect|INFO|unix:/var/run/openvswitch/ovnnb_db.sock: connected
2017-03-17T12:52:43Z|00004|reconnect|INFO|unix:/var/run/openvswitch/ovnsb_db.sock: connected

I've included my full bash history: bash.txt

greigs avatar Mar 17 '17 12:03 greigs

Hello @greigs ! I think you missed the part with master-init on the master node, I do not see it in your bash history. At the end of the tutorial (https://github.com/apprenda/kubernetes-ovn-heterogeneous-cluster/blob/master/master/README.md) there is a part saying "After making sure the API server is up & running, you need to configure pod networking for this node". Could you please try to execute that part too? Make sure you cleanup ovn-sbctl db first by running ovn-sbctl chassis-del <chassis_id>. As an example for this:

Chassis "2f45c22d-8dbd-4e0a-9a53-e4ec0da9855e" hostname: "sig-windows-worker-windows-1" Encap geneve ip: "10.142.0.3" options: {csum="true"}

You should do ovn-sbctl chassis-del 2f45c22d-8dbd-4e0a-9a53-e4ec0da9855e. Repeat this for every chassis then you can go and execute the last part of the tutorial.

alinbalutoiu avatar Mar 17 '17 13:03 alinbalutoiu

Ah! Thanks @alinbalutoiu Looks as though copying and pasting that script stopped at the apt install -y python-pip command. Easy mistake to make but I should pay more attention. I've finished running it now, and the windows node is no longer showing an error when running the install_ovn.ps1 This is promising. I'll continue the setup.

greigs avatar Mar 17 '17 14:03 greigs

The nodes is being seen. I tried deploying the dashboard but it failed due to not matching "linux", which makes sense. So then I tried adding a linux worker node as described. It is shown in the node list correctly.

I then removed and re-added the dashboard deployment. My error is now: Failed to setup network for pod \"kubernetes-dashboard-3203962772-t2382_kube-system(203e8324-0b43-11e7-879b-42010a8e0002)\" using network plugins \"cni\": ; Skipping pod"

Full output attached: describe.txt

Linux node setup (i copied the .pem files from the master node before running): linuxnode.txt

greigs avatar Mar 17 '17 19:03 greigs