dkron
dkron copied to clipboard
helm chart??
Is there already a helm chart for dkron? If not, is anybody already working on it ?
Nope, at least, not publicly. Feel free to do so and submit a pull request, if you think it is useful.
I have one (only for v2 wich work with k8s cloud auto-join), I can take time this evening to pull request it, but it's not fully configurable like "official" helm charts due to personal usage, so some work must be done. - only obvious things are configurable
I sent the pull request : https://github.com/distribworks/dkron/pull/681
Readiness/Liveness probe definitions here are waiting for /health to return HTTP 200, but 8080 port is not up before cluster is formed. Because of that, pods are not discoverable making cluster unable to form itself. Also Liveness probes kill pods in a loop, because of same reason.
The chart is not perfect, it could depends of the underlying hardware performance (if dkron take too much time to boot and exceed the deadline) and enter to the dead circle.
Have you already fix the issue on your side ? I will host the chart on my own as requested @victorcoder If you want submit PR and help us to have a great helm chart :+1:
Mine implementation attempts to use stable hostnames with StatefulSet and persistent drives. But anyway, also still working not good, but because of a missing thing in a Dkron itself which i'm trying to workaround/fix also :)
When there will be something to show, I'll open PR.
ok!
I'm interested by statefulset approach (I use deployment currently), because I tried (and failed), the hostname is changing on pod eviction (restart, fail) it missmatch (because host seem's to be persisted into local db and not refreshed) at the reboot on k8s. So I moved to deployment due to schrodinger syndrome
I don't know if it's a bug or somthing well known. Also deployment have a drawback, dkron seems to not evict agent (which could failed, and restarted by k8s) and the list of agent with state "left" looks exponential, some gc could be great for left state
There is a GC actually. There's "Reap" interval which is 24h by default. It removes these "left" and "failed" nodes after some time. Also you can send "raft left-peer" to a Leader node on pod termination event. That's currently what I'm doing to handle "server" Dkron pod restarts. In case of non-server agents I just don't care. They'll be reaped later.
Ok, interesting, do you have snippet of code about "raft left-peer" to share ?
I didn't see problems on pod termination, I guess signal are interpreted to handle a "graceful shutdown", at least never experienced issue or weird "behavior"
I have it, but it's kinda still ugly and requires some patches to Dkron which are not there currently :)
Script is waiting for SIGTERM, and when in catches it, it terminates Dkron, then runs "raft remove-peer" Dkron is recompiled with Reap interval 1 minute, instead of 24h, because we're running in Kubernetes and don't expect node to come back with same IP. Also, there's an endpoint http://dkron-server-api which is a Kubernetes Service that returns IP of a Leader pod, because you have to send "remove-peer" to a Leader pod. So, in order to get Leader pod IP, I had to add a web endpoint to Dkron which returns status 200 for a Leader pod and have it in Readiness probe. Kinda that :)
command:
- "/bin/sh"
- "-cx"
- |
set -o pipefail
_term() {
echo "Caught SIGTERM"
kill -TERM "$child"
echo "Waiting for 2 minutes, as it the approximate time to reap left node"
sleep 120
echo "Removing self from Raft"
/opt/dkron/bin/dkron raft remove-peer --peer-id=$(hostname) --rpc-addr="$(wget -q -O - http://dkron-server-api:8080/v1/leader?pretty=true|jq .Tags.rpc_addr -r)"
echo "Done"
}
if [ -z "$POD_IP" ]; then
POD_IP=$(hostname -i)
fi
FQDN_SUFFIX="${STATEFULSET_NAME}.${STATEFULSET_NAMESPACE}.svc.cluster.local"
NODE_NAME="$(hostname -s).${FQDN_SUFFIX}"
JOIN_PEERS=""
for i in $( seq 0 $((${INITIAL_CLUSTER_SIZE} - 1)) ); do
JOIN_PEERS="${JOIN_PEERS}${JOIN_PEERS:+ }${STATEFULSET_NAME}-${i}.${FQDN_SUFFIX}"
done
# Require multiple loops in the case of unstable DNS resolution
SUCCESS_LOOPS=5
while [ "$SUCCESS_LOOPS" -gt 0 ]; do
ALL_READY=true
JOIN_LAN=""
for THIS_PEER in $JOIN_PEERS; do
# Make sure we can resolve hostname and ping IP
if PEER_IP="$( ( ping -c 1 $THIS_PEER || true ) | awk -F'[()]' '/PING/{print $2}')" && [ "$PEER_IP" != "" ]; then
if [ "${PEER_IP}" != "${POD_IP}" ]; then
JOIN_LAN="${JOIN_LAN}${JOIN_LAN:+ } --retry-join=$THIS_PEER"
fi
else
ALL_READY=false
break
fi
done
if $ALL_READY; then
SUCCESS_LOOPS=$(( SUCCESS_LOOPS - 1 ))
echo "LAN peers appear ready, $SUCCESS_LOOPS verifications left"
else
echo "Waiting for LAN peer $THIS_PEER..."
fi
sleep 1s
done
LEADER=$(wget -q -O - http://dkron-server-api:8080/v1/leader?pretty=true|jq .Tags.rpc_addr -r)
if [ -z "${LEADER}" ]; then
BOOTSTRAP_EXPECT="--bootstrap-expect $( echo "$JOIN_PEERS" | wc -w )"
else
BOOTSTRAP_EXPECT=""
fi
trap _term SIGTERM
/opt/dkron/bin/dkron agent \
--server \
--bind-addr=0.0.0.0 \
--statsd-addr=127.0.0.1:9125 \
--log-level=debug \
--profile=wan \
${BOOTSTRAP_EXPECT} \
${JOIN_LAN} &
child=$!
wait "$child"
sleep 10
Require multiple loops in the case of unstable DNS resolution
:joy:
haha, yes, it's a nice script indeed :) thanks for sharing !
Chart merged in https://github.com/distribworks/dkron-helm
Fixed in recent chart and dkron modifications.