dkron icon indicating copy to clipboard operation
dkron copied to clipboard

No way to recover cluster without deleting raft data. Older pod IPs are getting used on multipler server failure

Open kumaritanushree opened this issue 3 years ago • 10 comments

Describe the bug I am running dkron cluster of 3 nodes. I restarted on pod and in this case cluster is working fine and new pod(which I have restared) with new ip successfully joined the cluster.

then I restarted 2 nodes and this time cluster is down, even after restarting the pods cluster is not able to created. raft is using older ip of deleted pod to connect.

time="2021-07-02T08:47:09Z" level=info msg="2021-07-02T08:47:09.414Z [ERROR] raft: failed to make requestVote RPC: target="{Voter dkron-1 10.48.31.130:6868}" error="dial tcp 10.48.31.130:6868: connect: no route to host"" time="2021-07-02T08:47:09Z" level=info msg="2021/07/02 08:47:09 [DEBUG] memberlist: Stream connection from=10.48.32.45:48900" time="2021-07-02T08:47:09Z" level=info msg="2021-07-02T08:47:09.933Z [INFO] raft: duplicate requestVote for same term: term=131" time="2021-07-02T08:47:10Z" level=info msg="2021-07-02T08:47:10.106Z [ERROR] raft: failed to make requestVote RPC: target="{Voter dkron-1 10.48.31.130:6868}" error="dial tcp 10.48.31.130:6868: i/o timeout"" time="2021-07-02T08:47:10Z" level=info msg="2021-07-02T08:47:10.486Z [DEBUG] raft: lost leadership because received a requestVote with a newer term" time="2021-07-02T08:47:11Z" level=info msg="2021-07-02T08:47:11.032Z [INFO] raft: duplicate requestVote for same term: term=132" time="2021-07-02T08:47:11Z" level=info msg="2021-07-02T08:47:11.567Z [DEBUG] raft: lost leadership because received a requestVote with a newer term" time="2021-07-02T08:47:12Z" level=info msg="2021-07-02T08:47:12.574Z [DEBUG] raft: lost leadership because received a requestVote with a newer term" time="2021-07-02T08:47:13Z" level=info msg="2021-07-02T08:47:13.951Z [DEBUG] raft: lost leadership because received a requestVote with a newer term" time="2021-07-02T08:47:13Z" level=info msg="2021-07-02T08:47:13.953Z [WARN] raft: rejecting vote request since our last term is greater: candidate=10.48.31.131:6868 last-term=89 last-candidate-term=71" time="2021-07-02T08:47:15Z" level=info msg="2021-07-02T08:47:15.059Z [DEBUG] raft: lost leadership because received a requestVote with a newer term" time="2021-07-02T08:47:15Z" level=info msg="2021-07-02T08:47:15.063Z [WARN] raft: rejecting vote request since our last term is greater: candidate=10.48.31.131:6868 last-term=89 last-candidate-term=71" time="2021-07-02T08:47:15Z" level=info msg="2021-07-02T08:47:15.878Z [WARN] raft: heartbeat timeout reached, starting election: last-leader=" time="2021-07-02T08:47:15Z" level=info msg="2021-07-02T08:47:15.878Z [INFO] raft: entering candidate state: node="Node at 10.48.34.150:6868 [Candidate]" term=137" time="2021-07-02T08:47:15Z" level=info msg="2021-07-02T08:47:15.885Z [DEBUG] raft: votes: needed=2" time="2021-07-02T08:47:15Z" level=info msg="2021-07-02T08:47:15.885Z [DEBUG] raft: vote granted: from=dkron-0 term=137 tally=1" time="2021-07-02T08:47:16Z" level=info msg="2021/07/02 08:47:16 [DEBUG] memberlist: Initiating push/pull sync with: dkron-2 10.48.32.45:8946" time="2021-07-02T08:47:16Z" level=info msg="2021-07-02T08:47:16.895Z [INFO] raft: duplicate requestVote for same term: term=137" time="2021-07-02T08:47:16Z" level=info msg="2021-07-02T08:47:16.968Z [WARN] raft: Election timeout reached, restarting election" time="2021-07-02T08:47:16Z" level=info msg="2021-07-02T08:47:16.969Z [INFO] raft: entering candidate state: node="Node at 10.48.34.150:6868 [Candidate]" term=138" time="2021-07-02T08:47:16Z" level=info msg="2021-07-02T08:47:16.975Z [DEBUG] raft: votes: needed=2" time="2021-07-02T08:47:16Z" level=info msg="2021-07-02T08:47:16.976Z [DEBUG] raft: vote granted: from=dkron-0 term=138 tally=1" time="2021-07-02T08:47:17Z" level=info msg="2021-07-02T08:47:17.860Z [INFO] raft: duplicate requestVote for same term: term=138" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.253Z [WARN] raft: Election timeout reached, restarting election" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.253Z [INFO] raft: entering candidate state: node="Node at 10.48.34.150:6868 [Candidate]" term=139" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.260Z [DEBUG] raft: votes: needed=2" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.260Z [DEBUG] raft: vote granted: from=dkron-0 term=139 tally=1" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.951Z [ERROR] raft: failed to make requestVote RPC: target="{Voter dkron-1 10.48.31.130:6868}" error="dial tcp 10.48.31.130:6868: connect: no route to host"" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.951Z [ERROR] raft: failed to make requestVote RPC: target="{Voter dkron-1 10.48.31.130:6868}" error="dial tcp 10.48.31.130:6868: connect: no route to host"" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.951Z [ERROR] raft: failed to make requestVote RPC: target="{Voter dkron-1 10.48.31.130:6868}" error="dial tcp 10.48.31.130:6868: connect: no route to host""

this error is going in loop and cluster is not able to recover.

yaml file what I am using to deploy dkron cluster apiVersion: apps/v1 kind: StatefulSet metadata: name: dkron labels: app: dkron spec: serviceName: dkron replicas: 3 selector: matchLabels: app: dkron template: metadata: labels: app: dkron spec: containers: - name: dkron env: - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name mage: dkron/dkron command: - "bash" - "-c" - "/opt/local/dkron/dkron
agent
--server
--bootstrap-expect=3
--join=dkron
--data-dir=/var/lib/dkron/dkron.data
--log-level=debug
--tag=dkron_server=true \ --advertise-addr=${HOSTNAME}.dkron.default.svc.cluster.local
--advertise-rpc-port=6868
--retry-join=dkron"

Expected behavior After restarting two out of three nodes cluster should get recover and should start working without loss of any data

Addition info

pod info before deletion of pods(two servers) $ kubectl get pods -o wide | grep dkron dkron-0 1/1 Running 0 3m10s 10.48.34.150 gke-dev-sandbox-nats-pool-4b9e27f9-403h dkron-1 1/1 Running 0 14s 10.48.31.130 gke-dev-sandbox-nats-pool-5bbad0ed-c635 dkron-2 1/1 Running 0 2m43s 10.48.32.44 gke-dev-sandbox-nats-pool-bf3e5ce4-91eb

pod infor after restarting two pods

$ kubectl get pods -o wide | grep dkron dkron-0 1/1 Running 0 7m13s 10.48.34.150 gke-dev-sandbox-nats-pool-4b9e27f9-403h dkron-1 1/1 Running 0 28s 10.48.31.131 gke-dev-sandbox-nats-pool-5bbad0ed-c635 dkron-2 1/1 Running 0 17s 10.48.32.45 gke-dev-sandbox-nats-pool-bf3e5ce4-91eb

Please help as soon as possible

kumaritanushree avatar Jul 02 '21 09:07 kumaritanushree

I have the same or at least an very similar issue: after all dkron nodes/servers were down simultaneously I'm not able to get the cluster back working again - leader election never succeeds.

What I have I've setuped a kubernetes stateful set (tried with 2 and 3 replicas) to initialy get dkron up and running - in the first step. everything is looking good - the member servers join the cluster, a leader is elected and displaye on the WebUI. I can create, schedule and run jobs. Here is the statefulsets yaml file:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dkron
  namespace: scheduler
  labels:
    role: scheduler
spec:
  selector:
    matchLabels:
      role: scheduler
  serviceName: dkron-svc
  replicas: 2
  template:
    metadata:
      labels:
        role: scheduler
      namespace: scheduler
    spec:
      terminationGracePeriodSeconds: 20
      serviceAccountName: dkron
      containers:
      - image: dkron/dkron:v3.1.8
        command: ["/bin/sh", "-c"]
        args:
          - echo starting;
            sleep 5;
           /opt/local/dkron/dkron agent --server --advertise-addr=${HOSTNAME}.dkron-svc --advertise-rpc-port=6868;
        name: dkron
        ports:
        - containerPort: 8080
        - containerPort: 6868
        - containerPort: 8946
        volumeMounts:
            - name: dkron-data
              mountPath: /dkron.data
            - name: config
              mountPath: "/etc/dkron/"
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: dkron-configmap
            items:
            - key: dkron.yml
              path: dkron.yml
  volumeClaimTemplates:
    - metadata:
        name: dkron-data
        annotations:
          volume.beta.kubernetes.io/storage-class: "standard"
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: 5Gi

As you can see in the yaml above, Im providing a dkron.yml file with config parameters:

    server: true
    bootstrap-expect: 2
    data-dir: /dkron.data
    retry-join: ["provider=k8s namespace=scheduler label_selector=\"role=scheduler\""]
    log-level: debug
    raft-multiplier: 1
    encrypt: somekey...

also tried to replace the "retry-join" line with the discrete member names - the hostnames can be resolved using dns

retry-join:
   - dkron-0.dkron-svc
   - dkron-1.dkron-svc
   - ...

Initialy I can get the cluster up&running without any problems. The members join the cluster and elect a leader. I can access the GUI to create, schedule and start job there - good!

I can "reboot" single cluster server nodes also - but one all server were down at the same time, I'm not able to get things going again. I was now playing around for hours with all the "retry-join", "bootstrap-expect", advertise-addr, etc options but the outcome is always something like this:

time="2021-07-16T11:11:05Z" level=info msg="2021-07-16T11:11:05.264Z [INFO]  raft: entering candidate state: node=\"Node at 10.43.68.185:6868 [Candidate]\" term=1691"
time="2021-07-16T11:11:05Z" level=info msg="2021-07-16T11:11:05.270Z [DEBUG] raft: votes: needed=2"
time="2021-07-16T11:11:05Z" level=info msg="2021-07-16T11:11:05.271Z [DEBUG] raft: vote granted: from=dkron-1 term=1691 tally=1"
time="2021-07-16T11:11:06Z" level=info msg="2021-07-16T11:11:06.139Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-0 10.43.66.225:6868}\" error=\"dial tcp 10.43.66.225:6868: i/o timeout\""
time="2021-07-16T11:11:07Z" level=info msg="2021-07-16T11:11:07.037Z [WARN]  raft: Election timeout reached, restarting election"
time="2021-07-16T11:11:07Z" level=info msg="2021-07-16T11:11:07.037Z [INFO]  raft: entering candidate state: node=\"Node at 10.43.68.185:6868 [Candidate]\" term=1692"
time="2021-07-16T11:11:07Z" level=info msg="2021-07-16T11:11:07.044Z [DEBUG] raft: votes: needed=2"
time="2021-07-16T11:11:07Z" level=info msg="2021-07-16T11:11:07.044Z [DEBUG] raft: vote granted: from=dkron-1 term=1692 tally=1"
time="2021-07-16T11:11:07Z" level=info msg="2021-07-16T11:11:07.487Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-0 10.43.66.225:6868}\" error=\"dial tcp 10.43.66.225:6868: i/o timeout\""
time="2021-07-16T11:11:08Z" level=info msg="2021-07-16T11:11:08.629Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-0 10.43.66.225:6868}\" error=\"dial tcp 10.43.66.225:6868: i/o timeout\""
time="2021-07-16T11:11:08Z" level=info msg="2021-07-16T11:11:08.781Z [WARN]  raft: Election timeout reached, restarting election"

Every node seems to be able to only "see" itself (10.43.68.185 in this case)and also trying to connect to ips the members or the last leader had before (e.g. 10.43.66.225)

What am I missing? How can I achieve this kind of desaster recovery? In a kubernetes environment it is quite common that pods are recreated and change their ips therefore - as dns names are static, usually you are using these to get things going...

I'm really running out of ideas right now...

Any help or hints are highly appreciated.

rmuehlbauer avatar Jul 16 '21 11:07 rmuehlbauer

So two things worked for me. I've noticed that if a single pod is restated It quickly joins the cluster w/o any trouble. But if the sts is rolled-out the cluster fails. I've added readiness check: readinessProbe: tcpSocket: port: serf initialDelaySeconds: 150 and these flags: - --serf-reconnect-timeout=5s

AlonGluz avatar Sep 01 '21 16:09 AlonGluz

thanks for your suggestions @AlonGluz - in the end, my problem still remains the same. Once all clustermembers are down at the same time, I cant get the cluster back up working again - every cluster member tries to connect to the other cluster members, using their (old) ip adresses - but of course due to changes ip adresses this fails. e.g. Node calles "dkron-2" trying to connect to "dkron-1" and "dkron-0":

time="2021-09-15T20:03:33Z" level=info msg="2021-09-15T20:03:33.139Z [DEBUG] raft: votes: needed=2"
time="2021-09-15T20:03:33Z" level=info msg="2021-09-15T20:03:33.139Z [DEBUG] raft: vote granted: from=dkron-2 term=25810 tally=1"
time="2021-09-15T20:03:33Z" level=info msg="2021-09-15T20:03:33.727Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-0 10.43.66.84:6868}\" error=\"dial tcp 10.43.66.84:6868: i/o timeout\""
time="2021-09-15T20:03:33Z" level=info msg="2021-09-15T20:03:33.734Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-1 10.43.68.188:6868}\" error=\"dial tcp 10.43.68.188:6868: i/o timeout\""
time="2021-09-15T20:03:34Z" level=info msg="2021-09-15T20:03:34.773Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-0 10.43.66.84:6868}\" error=\"dial tcp 10.43.66.84:6868: i/o timeout\""
time="2021-09-15T20:03:34Z" level=info msg="2021-09-15T20:03:34.777Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-1 10.43.68.188:6868}\" error=\"dial tcp 10.43.68.188:6868: i/o timeout\""
time="2021-09-15T20:03:35Z" level=info msg="2021-09-15T20:03:35.120Z [WARN]  raft: Election timeout reached, restarting election"

I think as long as raft does not use DNS names instead of ip adresses this problem wont go away. I really wonder, how the other guys, running dkron on kubernetes, got this working....

rmuehlbauer avatar Sep 15 '21 20:09 rmuehlbauer

@AlonGluz , just because I'm curious - would you mind to send me (or post) your whole config? I think relevant and interesting are all dkron start parameters and/or dkron.yaml file. And please also add the relevant part of your kubernetes Statefulset manifest. To be 100% sure - if you also created a separate serviceaccount for dkron, please also give me some information here (but I dont think this is where the problems comes from, because the cluster members can "find" each other). Would you mind to do that?

rmuehlbauer avatar Sep 16 '21 07:09 rmuehlbauer

Hi @rmuehlbauer , @kumaritanushree Were you guys able to setup dkron in K8S as a statefulset ? I have posted https://github.com/distribworks/dkron/issues/1191 with some questions. I know this is 1 year old post but would it be possible for you to share your experience with dkron in prod ? Can you share your K8S Manifest and dkron.yaml ? Thanks for your time on this.

nikunj-badjatya avatar Oct 13 '22 04:10 nikunj-badjatya

@nikunj-badjatya Really sorry but I am not working on dkron since a year. I do not have access to dkron manifests now what I was using a year back.

kumaritanushree avatar Oct 13 '22 05:10 kumaritanushree

This worked for us, basically it delays each pod rollout: configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: dkron-server
  namespace: dkron
data:
  GODEBUG: 'netdns=go'
  DKRON_ENABLE_PROMETHEUS: 'false'
  dkron.yml: |+
    server: true
    bootstrap-expect: 3
    data-dir: /dkron/data
    retry-join: ["provider=k8s namespace=dkron label_selector=\"component=server\""]
    log-level: info
    serf-reconnect-timeout: 5s
    disable-usage-stats: true```
    
sts.yaml
```---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dkron-server
  namespace: dkron
  labels:
    app: dkron
    component: server
spec:
  replicas: 5
  serviceName: dkron-server
  selector:
    matchLabels:
      app: dkron
      component: server
  template:
    metadata:
      labels: 
        app: dkron
        component: server
      annotations:
        linkerd.io/inject: disabled
    spec:
      serviceAccountName: dkron
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values: 
                - dkron
            topologyKey: kubernetes.io/hostname
      containers:
        - name: dkron-server
          image: dkron/dkron:3.2.1
          envFrom:
            - configMapRef:
                name: dkron-server
          ports:
            - name: http
              containerPort: 8080
            - name: serf
              containerPort: 8946
            - name: grpc
              containerPort: 6868
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 256Mi
          command: 
           - dkron
          args:
            - agent
          volumeMounts: 
            - name: data
              mountPath: /dkron/data
            - name: config
              mountPath: /etc/dkron/
              readOnly: true
          startupProbe:
            tcpSocket:
              port: serf
            initialDelaySeconds: 150
            periodSeconds: 10
          readinessProbe:
            tcpSocket:
              port: serf
            initialDelaySeconds: 150
            periodSeconds: 10
          livenessProbe:
            tcpSocket:
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
      volumes:
        - name: config
          configMap:
            name: dkron-server
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: ssd
        resources:
          requests:
            storage: 50Gi```

AlonGluz avatar Oct 13 '22 06:10 AlonGluz

@nikunj-badjatya I'm also not using dkron atm - to me it just seemed to unstable und unpredictable for prod usage on k8s. Sorry. Maybe I will try again with the configmap posted by @AlonGluz - wonder if that can do the trick. As fas as I can remember, my problem one year ago basically was, that the dkron cluster members always tried to reconnnect to restarted pods using their old ips, not respecting the fact that ips have changed and should be re-resolved by dns first...

rmuehlbauer avatar Oct 13 '22 07:10 rmuehlbauer

We incurred i the problem described by @kumaritanushree . In order to restore the service we had to delete the Persistent Volumes also, because the cluster confiugration is stored there. Doing that, obviously, we also lost the schedule. Did someone found a viable solution in order to safely run DKRON in production on AKS? I agree that DKRON should save the hostnames of their nodes (that as StatefulSets are preserved by the Cluster) intead of the POD's IP addresses

rkapoz-tech avatar Jan 26 '23 10:01 rkapoz-tech

@vcastellm any news about this issue? we are still experiencing this issue was it fixed in later version of dkron? we are running version 3.2.1

omer2500 avatar Nov 08 '23 15:11 omer2500