kubespray icon indicating copy to clipboard operation
kubespray copied to clipboard

Job for etcd.service failed because the control process exited with error code

Open rcbandit111 opened this issue 2 years ago • 4 comments
trafficstars

Environment:

  • Oracle cloud VM

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"): Linux 5.15.0-1040-oracle aarch64 PRETTY_NAME="Ubuntu 22.04.3 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.3 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy

  • Version of Ansible (ansible --version): ansible [core 2.14.10] config file = /root/kubespray/ansible.cfg configured module search path = ['/root/kubespray/library'] ansible python module location = /usr/local/lib/python3.10/dist-packages/ansible ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections executable location = /usr/local/bin/ansible python version = 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (/usr/bin/python3) jinja version = 3.1.2 libyaml = True

  • Version of Python (python --version): Python 3.10.12

Kubespray version (commit) (git rev-parse --short HEAD): 0f243d751

Network plugin used: N/A

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

Command used to invoke ansible: ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

Output of ansible run:

I want to install Kubernetes on Oracle Cloud VMs using Kubespray. I tried these steps:

apt install python3-pip

git clone https://github.com/kubernetes-sigs/kubespray
cd kubespray
git checkout master #if you want to change version

sudo pip3 install -r requirements.txt
cp -rfp inventory/sample inventory/mycluster

declare -a IPS=(192.168.1.24 192.168.1.25 192.168.1.26)
CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}
    
vi inventory/mycluster/hosts.yaml

ansible-playbook -i inventory/mycluster/hosts.yaml  --become --become-user=root cluster.yml

Into Oracle cloud every VM have a private IP for example 10.0.0.x and public 141.147.3.x. I get this error during Kubespray execution:

fatal: [node1]: FAILED! => {"attempts": 4, "changed": false, "cmd": "set -o pipefail && /usr/local/bin/etcdctl endpoint --cluster status && /usr/local/bin/etcdctl endpoint --cluster health 2>&1 | grep -v 'Error: unhealthy cluster' >/dev/null", "delta": "0:00:05.018287", "end": "2023-09-17 16:26:01.614612", "msg": "non-zero return code", "rc": 1, "start": "2023-09-17 16:25:56.596325", "stderr": "{\"level\":\"warn\",\"ts\":\"2023-09-17T16:26:01.61284Z\",\"logger\":\"etcd-client\",\"caller\":\"[email protected]/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0x40002c0fc0/10.0.0.77:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\nError: failed to fetch endpoints from etcd cluster member list: context deadline exceeded", "stderr_lines": ["{\"level\":\"warn\",\"ts\":\"2023-09-17T16:26:01.61284Z\",\"logger\":\"etcd-client\",\"caller\":\"[email protected]/retry_interceptor.go:62\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"etcd-endpoints://0x40002c0fc0/10.0.0.77:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}", "Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded"], "stdout": "", "stdout_lines": []}

Into cluster file I have this configuration:

all:
  hosts:
    node1:
      ansible_host: 192.168.1.24
      ip: 192.168.1.24
      access_ip: 192.168.1.24
    node2:
      ansible_host: 192.168.1.25
      ip: 192.168.1.25
      access_ip: 192.168.1.25
    node3:
      ansible_host: 192.168.1.26
      ip: 192.168.1.26
      access_ip: 192.168.1.26
  children:
    kube_control_plane:
      hosts:
        node1:
    kube_node:
      hosts:
        node2:
        node3:
    etcd:
      hosts:
        node1:
        node2:
        node3:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}
  • These are just example IPs

Do you know how I I solve this issue?

PS:

Sep 17 16:00:58 node1 etcd[11821]: {"level":"info","ts":"2023-09-17T16:00:58.752307Z","caller":"embed/etcd.go:378","msg":"closed etcd serv
er","name":"etcd1","data-dir":"/var/lib/etcd","advertise-peer-urls":["https://141.147.3.x:2380"],"advertise-client-urls":["https://141.1
47.3.x:2379"]}
Sep 17 16:00:58 node1 etcd[11821]: {"level":"fatal","ts":"2023-09-17T16:00:58.752382Z","caller":"etcdmain/etcd.go:204","msg":"discovery fa
iled","error":"--initial-cluster has etcd1=https://10.0.0.x:2380 but missing from --initial-advertise-peer-urls=https://141.147.3.x:238
0 (resolved urls: \"https://141.147.3.x:2380\" != \"https://10.0.0.x:2380\")","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtc
dOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmai
n/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:250"}
Sep 17 16:00:58 node1 systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ ore--
░░ An ExecStart= process belonging to unit etcd.service has exited.
░░ ore--
░░ The process' exit code is 'exited' and its exit status is 1.
Sep 17 16:00:58 node1 systemd[1]: etcd.service: Failed with result 'exit-code'.

rcbandit111 avatar Sep 17 '23 16:09 rcbandit111

@rcbandit111 Is it really correct that you use external IP in all cases? (ansible_host, ip , access_ip) ip i guess, should be internal ip-address

ihippik avatar Sep 29 '23 09:09 ihippik

Hum maybe there is a confusion between the private and public ips somehow with ansible facts ? :thinking: The ip in your inventory do not match those in the error message, could you provide the exact task where this happens ? At least include the relevant portion of ansible output (including the task name)

VannTen avatar Jan 22 '24 13:01 VannTen

/triage needs-information

VannTen avatar Jan 30 '24 09:01 VannTen

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 29 '24 10:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar May 29 '24 10:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jun 28 '24 10:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 28 '24 10:06 k8s-ci-robot