dkron
dkron copied to clipboard
No way to recover cluster without deleting raft data. Older pod IPs are getting used on multipler server failure
Describe the bug I am running dkron cluster of 3 nodes. I restarted on pod and in this case cluster is working fine and new pod(which I have restared) with new ip successfully joined the cluster.
then I restarted 2 nodes and this time cluster is down, even after restarting the pods cluster is not able to created. raft is using older ip of deleted pod to connect.
time="2021-07-02T08:47:09Z" level=info msg="2021-07-02T08:47:09.414Z [ERROR] raft: failed to make requestVote RPC: target="{Voter dkron-1 10.48.31.130:6868}" error="dial tcp 10.48.31.130:6868: connect: no route to host"" time="2021-07-02T08:47:09Z" level=info msg="2021/07/02 08:47:09 [DEBUG] memberlist: Stream connection from=10.48.32.45:48900" time="2021-07-02T08:47:09Z" level=info msg="2021-07-02T08:47:09.933Z [INFO] raft: duplicate requestVote for same term: term=131" time="2021-07-02T08:47:10Z" level=info msg="2021-07-02T08:47:10.106Z [ERROR] raft: failed to make requestVote RPC: target="{Voter dkron-1 10.48.31.130:6868}" error="dial tcp 10.48.31.130:6868: i/o timeout"" time="2021-07-02T08:47:10Z" level=info msg="2021-07-02T08:47:10.486Z [DEBUG] raft: lost leadership because received a requestVote with a newer term" time="2021-07-02T08:47:11Z" level=info msg="2021-07-02T08:47:11.032Z [INFO] raft: duplicate requestVote for same term: term=132" time="2021-07-02T08:47:11Z" level=info msg="2021-07-02T08:47:11.567Z [DEBUG] raft: lost leadership because received a requestVote with a newer term" time="2021-07-02T08:47:12Z" level=info msg="2021-07-02T08:47:12.574Z [DEBUG] raft: lost leadership because received a requestVote with a newer term" time="2021-07-02T08:47:13Z" level=info msg="2021-07-02T08:47:13.951Z [DEBUG] raft: lost leadership because received a requestVote with a newer term" time="2021-07-02T08:47:13Z" level=info msg="2021-07-02T08:47:13.953Z [WARN] raft: rejecting vote request since our last term is greater: candidate=10.48.31.131:6868 last-term=89 last-candidate-term=71" time="2021-07-02T08:47:15Z" level=info msg="2021-07-02T08:47:15.059Z [DEBUG] raft: lost leadership because received a requestVote with a newer term" time="2021-07-02T08:47:15Z" level=info msg="2021-07-02T08:47:15.063Z [WARN] raft: rejecting vote request since our last term is greater: candidate=10.48.31.131:6868 last-term=89 last-candidate-term=71" time="2021-07-02T08:47:15Z" level=info msg="2021-07-02T08:47:15.878Z [WARN] raft: heartbeat timeout reached, starting election: last-leader=" time="2021-07-02T08:47:15Z" level=info msg="2021-07-02T08:47:15.878Z [INFO] raft: entering candidate state: node="Node at 10.48.34.150:6868 [Candidate]" term=137" time="2021-07-02T08:47:15Z" level=info msg="2021-07-02T08:47:15.885Z [DEBUG] raft: votes: needed=2" time="2021-07-02T08:47:15Z" level=info msg="2021-07-02T08:47:15.885Z [DEBUG] raft: vote granted: from=dkron-0 term=137 tally=1" time="2021-07-02T08:47:16Z" level=info msg="2021/07/02 08:47:16 [DEBUG] memberlist: Initiating push/pull sync with: dkron-2 10.48.32.45:8946" time="2021-07-02T08:47:16Z" level=info msg="2021-07-02T08:47:16.895Z [INFO] raft: duplicate requestVote for same term: term=137" time="2021-07-02T08:47:16Z" level=info msg="2021-07-02T08:47:16.968Z [WARN] raft: Election timeout reached, restarting election" time="2021-07-02T08:47:16Z" level=info msg="2021-07-02T08:47:16.969Z [INFO] raft: entering candidate state: node="Node at 10.48.34.150:6868 [Candidate]" term=138" time="2021-07-02T08:47:16Z" level=info msg="2021-07-02T08:47:16.975Z [DEBUG] raft: votes: needed=2" time="2021-07-02T08:47:16Z" level=info msg="2021-07-02T08:47:16.976Z [DEBUG] raft: vote granted: from=dkron-0 term=138 tally=1" time="2021-07-02T08:47:17Z" level=info msg="2021-07-02T08:47:17.860Z [INFO] raft: duplicate requestVote for same term: term=138" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.253Z [WARN] raft: Election timeout reached, restarting election" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.253Z [INFO] raft: entering candidate state: node="Node at 10.48.34.150:6868 [Candidate]" term=139" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.260Z [DEBUG] raft: votes: needed=2" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.260Z [DEBUG] raft: vote granted: from=dkron-0 term=139 tally=1" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.951Z [ERROR] raft: failed to make requestVote RPC: target="{Voter dkron-1 10.48.31.130:6868}" error="dial tcp 10.48.31.130:6868: connect: no route to host"" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.951Z [ERROR] raft: failed to make requestVote RPC: target="{Voter dkron-1 10.48.31.130:6868}" error="dial tcp 10.48.31.130:6868: connect: no route to host"" time="2021-07-02T08:47:18Z" level=info msg="2021-07-02T08:47:18.951Z [ERROR] raft: failed to make requestVote RPC: target="{Voter dkron-1 10.48.31.130:6868}" error="dial tcp 10.48.31.130:6868: connect: no route to host""
this error is going in loop and cluster is not able to recover.
yaml file what I am using to deploy dkron cluster
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: dkron
labels:
app: dkron
spec:
serviceName: dkron
replicas: 3
selector:
matchLabels:
app: dkron
template:
metadata:
labels:
app: dkron
spec:
containers:
- name: dkron
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
mage: dkron/dkron
command:
- "bash"
- "-c"
- "/opt/local/dkron/dkron
agent
--server
--bootstrap-expect=3
--join=dkron
--data-dir=/var/lib/dkron/dkron.data
--log-level=debug
--tag=dkron_server=true \
--advertise-addr=${HOSTNAME}.dkron.default.svc.cluster.local
--advertise-rpc-port=6868
--retry-join=dkron"
Expected behavior After restarting two out of three nodes cluster should get recover and should start working without loss of any data
Addition info
pod info before deletion of pods(two servers)
$ kubectl get pods -o wide | grep dkron
dkron-0 1/1 Running 0 3m10s 10.48.34.150 gke-dev-sandbox-nats-pool-4b9e27f9-403h
pod infor after restarting two pods
$ kubectl get pods -o wide | grep dkron
dkron-0 1/1 Running 0 7m13s 10.48.34.150 gke-dev-sandbox-nats-pool-4b9e27f9-403h
Please help as soon as possible
I have the same or at least an very similar issue: after all dkron nodes/servers were down simultaneously I'm not able to get the cluster back working again - leader election never succeeds.
What I have I've setuped a kubernetes stateful set (tried with 2 and 3 replicas) to initialy get dkron up and running - in the first step. everything is looking good - the member servers join the cluster, a leader is elected and displaye on the WebUI. I can create, schedule and run jobs. Here is the statefulsets yaml file:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: dkron
namespace: scheduler
labels:
role: scheduler
spec:
selector:
matchLabels:
role: scheduler
serviceName: dkron-svc
replicas: 2
template:
metadata:
labels:
role: scheduler
namespace: scheduler
spec:
terminationGracePeriodSeconds: 20
serviceAccountName: dkron
containers:
- image: dkron/dkron:v3.1.8
command: ["/bin/sh", "-c"]
args:
- echo starting;
sleep 5;
/opt/local/dkron/dkron agent --server --advertise-addr=${HOSTNAME}.dkron-svc --advertise-rpc-port=6868;
name: dkron
ports:
- containerPort: 8080
- containerPort: 6868
- containerPort: 8946
volumeMounts:
- name: dkron-data
mountPath: /dkron.data
- name: config
mountPath: "/etc/dkron/"
readOnly: true
volumes:
- name: config
configMap:
name: dkron-configmap
items:
- key: dkron.yml
path: dkron.yml
volumeClaimTemplates:
- metadata:
name: dkron-data
annotations:
volume.beta.kubernetes.io/storage-class: "standard"
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 5Gi
As you can see in the yaml above, Im providing a dkron.yml file with config parameters:
server: true
bootstrap-expect: 2
data-dir: /dkron.data
retry-join: ["provider=k8s namespace=scheduler label_selector=\"role=scheduler\""]
log-level: debug
raft-multiplier: 1
encrypt: somekey...
also tried to replace the "retry-join" line with the discrete member names - the hostnames can be resolved using dns
retry-join:
- dkron-0.dkron-svc
- dkron-1.dkron-svc
- ...
Initialy I can get the cluster up&running without any problems. The members join the cluster and elect a leader. I can access the GUI to create, schedule and start job there - good!
I can "reboot" single cluster server nodes also - but one all server were down at the same time, I'm not able to get things going again. I was now playing around for hours with all the "retry-join", "bootstrap-expect", advertise-addr, etc options but the outcome is always something like this:
time="2021-07-16T11:11:05Z" level=info msg="2021-07-16T11:11:05.264Z [INFO] raft: entering candidate state: node=\"Node at 10.43.68.185:6868 [Candidate]\" term=1691"
time="2021-07-16T11:11:05Z" level=info msg="2021-07-16T11:11:05.270Z [DEBUG] raft: votes: needed=2"
time="2021-07-16T11:11:05Z" level=info msg="2021-07-16T11:11:05.271Z [DEBUG] raft: vote granted: from=dkron-1 term=1691 tally=1"
time="2021-07-16T11:11:06Z" level=info msg="2021-07-16T11:11:06.139Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-0 10.43.66.225:6868}\" error=\"dial tcp 10.43.66.225:6868: i/o timeout\""
time="2021-07-16T11:11:07Z" level=info msg="2021-07-16T11:11:07.037Z [WARN] raft: Election timeout reached, restarting election"
time="2021-07-16T11:11:07Z" level=info msg="2021-07-16T11:11:07.037Z [INFO] raft: entering candidate state: node=\"Node at 10.43.68.185:6868 [Candidate]\" term=1692"
time="2021-07-16T11:11:07Z" level=info msg="2021-07-16T11:11:07.044Z [DEBUG] raft: votes: needed=2"
time="2021-07-16T11:11:07Z" level=info msg="2021-07-16T11:11:07.044Z [DEBUG] raft: vote granted: from=dkron-1 term=1692 tally=1"
time="2021-07-16T11:11:07Z" level=info msg="2021-07-16T11:11:07.487Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-0 10.43.66.225:6868}\" error=\"dial tcp 10.43.66.225:6868: i/o timeout\""
time="2021-07-16T11:11:08Z" level=info msg="2021-07-16T11:11:08.629Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-0 10.43.66.225:6868}\" error=\"dial tcp 10.43.66.225:6868: i/o timeout\""
time="2021-07-16T11:11:08Z" level=info msg="2021-07-16T11:11:08.781Z [WARN] raft: Election timeout reached, restarting election"
Every node seems to be able to only "see" itself (10.43.68.185 in this case)and also trying to connect to ips the members or the last leader had before (e.g. 10.43.66.225)
What am I missing? How can I achieve this kind of desaster recovery? In a kubernetes environment it is quite common that pods are recreated and change their ips therefore - as dns names are static, usually you are using these to get things going...
I'm really running out of ideas right now...
Any help or hints are highly appreciated.
So two things worked for me. I've noticed that if a single pod is restated It quickly joins the cluster w/o any trouble. But if the sts is rolled-out the cluster fails.
I've added readiness check:
readinessProbe: tcpSocket: port: serf initialDelaySeconds: 150
and these flags:
- --serf-reconnect-timeout=5s
thanks for your suggestions @AlonGluz - in the end, my problem still remains the same. Once all clustermembers are down at the same time, I cant get the cluster back up working again - every cluster member tries to connect to the other cluster members, using their (old) ip adresses - but of course due to changes ip adresses this fails. e.g. Node calles "dkron-2" trying to connect to "dkron-1" and "dkron-0":
time="2021-09-15T20:03:33Z" level=info msg="2021-09-15T20:03:33.139Z [DEBUG] raft: votes: needed=2"
time="2021-09-15T20:03:33Z" level=info msg="2021-09-15T20:03:33.139Z [DEBUG] raft: vote granted: from=dkron-2 term=25810 tally=1"
time="2021-09-15T20:03:33Z" level=info msg="2021-09-15T20:03:33.727Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-0 10.43.66.84:6868}\" error=\"dial tcp 10.43.66.84:6868: i/o timeout\""
time="2021-09-15T20:03:33Z" level=info msg="2021-09-15T20:03:33.734Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-1 10.43.68.188:6868}\" error=\"dial tcp 10.43.68.188:6868: i/o timeout\""
time="2021-09-15T20:03:34Z" level=info msg="2021-09-15T20:03:34.773Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-0 10.43.66.84:6868}\" error=\"dial tcp 10.43.66.84:6868: i/o timeout\""
time="2021-09-15T20:03:34Z" level=info msg="2021-09-15T20:03:34.777Z [ERROR] raft: failed to make requestVote RPC: target=\"{Voter dkron-1 10.43.68.188:6868}\" error=\"dial tcp 10.43.68.188:6868: i/o timeout\""
time="2021-09-15T20:03:35Z" level=info msg="2021-09-15T20:03:35.120Z [WARN] raft: Election timeout reached, restarting election"
I think as long as raft does not use DNS names instead of ip adresses this problem wont go away. I really wonder, how the other guys, running dkron on kubernetes, got this working....
@AlonGluz , just because I'm curious - would you mind to send me (or post) your whole config? I think relevant and interesting are all dkron start parameters and/or dkron.yaml file. And please also add the relevant part of your kubernetes Statefulset manifest. To be 100% sure - if you also created a separate serviceaccount for dkron, please also give me some information here (but I dont think this is where the problems comes from, because the cluster members can "find" each other). Would you mind to do that?
Hi @rmuehlbauer , @kumaritanushree Were you guys able to setup dkron in K8S as a statefulset ? I have posted https://github.com/distribworks/dkron/issues/1191 with some questions. I know this is 1 year old post but would it be possible for you to share your experience with dkron in prod ? Can you share your K8S Manifest and dkron.yaml ? Thanks for your time on this.
@nikunj-badjatya Really sorry but I am not working on dkron since a year. I do not have access to dkron manifests now what I was using a year back.
This worked for us, basically it delays each pod rollout: configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: dkron-server
namespace: dkron
data:
GODEBUG: 'netdns=go'
DKRON_ENABLE_PROMETHEUS: 'false'
dkron.yml: |+
server: true
bootstrap-expect: 3
data-dir: /dkron/data
retry-join: ["provider=k8s namespace=dkron label_selector=\"component=server\""]
log-level: info
serf-reconnect-timeout: 5s
disable-usage-stats: true```
sts.yaml
```---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: dkron-server
namespace: dkron
labels:
app: dkron
component: server
spec:
replicas: 5
serviceName: dkron-server
selector:
matchLabels:
app: dkron
component: server
template:
metadata:
labels:
app: dkron
component: server
annotations:
linkerd.io/inject: disabled
spec:
serviceAccountName: dkron
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- dkron
topologyKey: kubernetes.io/hostname
containers:
- name: dkron-server
image: dkron/dkron:3.2.1
envFrom:
- configMapRef:
name: dkron-server
ports:
- name: http
containerPort: 8080
- name: serf
containerPort: 8946
- name: grpc
containerPort: 6868
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 256Mi
command:
- dkron
args:
- agent
volumeMounts:
- name: data
mountPath: /dkron/data
- name: config
mountPath: /etc/dkron/
readOnly: true
startupProbe:
tcpSocket:
port: serf
initialDelaySeconds: 150
periodSeconds: 10
readinessProbe:
tcpSocket:
port: serf
initialDelaySeconds: 150
periodSeconds: 10
livenessProbe:
tcpSocket:
port: http
initialDelaySeconds: 5
periodSeconds: 10
volumes:
- name: config
configMap:
name: dkron-server
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: ssd
resources:
requests:
storage: 50Gi```
@nikunj-badjatya I'm also not using dkron atm - to me it just seemed to unstable und unpredictable for prod usage on k8s. Sorry. Maybe I will try again with the configmap posted by @AlonGluz - wonder if that can do the trick. As fas as I can remember, my problem one year ago basically was, that the dkron cluster members always tried to reconnnect to restarted pods using their old ips, not respecting the fact that ips have changed and should be re-resolved by dns first...
We incurred i the problem described by @kumaritanushree . In order to restore the service we had to delete the Persistent Volumes also, because the cluster confiugration is stored there. Doing that, obviously, we also lost the schedule. Did someone found a viable solution in order to safely run DKRON in production on AKS? I agree that DKRON should save the hostnames of their nodes (that as StatefulSets are preserved by the Cluster) intead of the POD's IP addresses
@vcastellm any news about this issue? we are still experiencing this issue was it fixed in later version of dkron? we are running version 3.2.1