kubespray
kubespray copied to clipboard
Job for etcd.service failed because the control process exited with error code
Environment:
-
Oracle cloud VM
-
OS (
printf "$(uname -srm)\n$(cat /etc/os-release)\n"): Linux 5.15.0-1040-oracle aarch64 PRETTY_NAME="Ubuntu 22.04.3 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.3 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy -
Version of Ansible (
ansible --version): ansible [core 2.14.10] config file = /root/kubespray/ansible.cfg configured module search path = ['/root/kubespray/library'] ansible python module location = /usr/local/lib/python3.10/dist-packages/ansible ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections executable location = /usr/local/bin/ansible python version = 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (/usr/bin/python3) jinja version = 3.1.2 libyaml = True -
Version of Python (
python --version): Python 3.10.12
Kubespray version (commit) (git rev-parse --short HEAD):
0f243d751
Network plugin used: N/A
Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
Command used to invoke ansible: ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml
Output of ansible run:
I want to install Kubernetes on Oracle Cloud VMs using Kubespray. I tried these steps:
apt install python3-pip
git clone https://github.com/kubernetes-sigs/kubespray
cd kubespray
git checkout master #if you want to change version
sudo pip3 install -r requirements.txt
cp -rfp inventory/sample inventory/mycluster
declare -a IPS=(192.168.1.24 192.168.1.25 192.168.1.26)
CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}
vi inventory/mycluster/hosts.yaml
ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml
Into Oracle cloud every VM have a private IP for example 10.0.0.x and public 141.147.3.x. I get this error during Kubespray execution:
fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.018287", "end": "2023-09-17 16:26:01.614612", "msg": "non-zero return code", "rc": 1, "start": "2023-09-17 16:25:56.596325", "stderr": "{\"level\":\"warn\",\"ts\":\"2023-09-17T16:26:01.61284Z\",\"logger\":\"etcd-client\",\"caller\":\"[email protected]/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0x40002c0fc0/10.0.0.77:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2023-09-17T16:26:01.61284Z\",\"logger\":\"etcd-client\",\"caller\":\"[email protected]/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0x40002c0fc0/10.0.0.77:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}
Into cluster file I have this configuration:
all:
hosts:
node1:
ansible_host: 192.168.1.24
ip: 192.168.1.24
access_ip: 192.168.1.24
node2:
ansible_host: 192.168.1.25
ip: 192.168.1.25
access_ip: 192.168.1.25
node3:
ansible_host: 192.168.1.26
ip: 192.168.1.26
access_ip: 192.168.1.26
children:
kube_control_plane:
hosts:
node1:
kube_node:
hosts:
node2:
node3:
etcd:
hosts:
node1:
node2:
node3:
k8s_cluster:
children:
kube_control_plane:
kube_node:
calico_rr:
hosts: {}
- These are just example IPs
Do you know how I I solve this issue?
PS:
Sep 17 16:00:58 node1 etcd[11821]: {"level":"info","ts":"2023-09-17T16:00:58.752307Z","caller":"embed/etcd.go:378","msg":"closed etcd serv
er","name":"etcd1","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://141.147.3.x:2380"],"advertise-client-urls":["https://141.1
47.3.x:2379"]}
Sep 17 16:00:58 node1 etcd[11821]: {"level":"fatal","ts":"2023-09-17T16:00:58.752382Z","caller":"etcdmain/etcd.go:204","msg":"discovery fa
iled","error":"--initial-cluster has etcd1=https://10.0.0.x:2380 but missing from --initial-advertise-peer-urls=https://141.147.3.x:238
0 (resolved urls: \"https://141.147.3.x:2380\" != \"https://10.0.0.x:2380\")","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtc
dOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmai
n/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:250"}
Sep 17 16:00:58 node1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ ore--
░░ An ExecStart= process belonging to unit etcd.service has exited.
░░ ore--
░░ The process' exit code is 'exited' and its exit status is 1.
Sep 17 16:00:58 node1 systemd[1]: etcd.service: Failed with result 'exit-code'.
@rcbandit111 Is it really correct that you use external IP in all cases? (ansible_host, ip , access_ip)
ip i guess, should be internal ip-address
Hum maybe there is a confusion between the private and public ips somehow with ansible facts ? :thinking: The ip in your inventory do not match those in the error message, could you provide the exact task where this happens ? At least include the relevant portion of ansible output (including the task name)
/triage needs-information
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.