dkron icon indicating copy to clipboard operation
dkron copied to clipboard

helm chart??

Open beyondszine opened this issue 5 years ago • 11 comments

Is there already a helm chart for dkron? If not, is anybody already working on it ?

beyondszine avatar Feb 11 '20 08:02 beyondszine

Nope, at least, not publicly. Feel free to do so and submit a pull request, if you think it is useful.

yvanoers avatar Feb 11 '20 21:02 yvanoers

I have one (only for v2 wich work with k8s cloud auto-join), I can take time this evening to pull request it, but it's not fully configurable like "official" helm charts due to personal usage, so some work must be done. - only obvious things are configurable

jjsaunier avatar Feb 12 '20 12:02 jjsaunier

I sent the pull request : https://github.com/distribworks/dkron/pull/681

jjsaunier avatar Feb 13 '20 21:02 jjsaunier

Readiness/Liveness probe definitions here are waiting for /health to return HTTP 200, but 8080 port is not up before cluster is formed. Because of that, pods are not discoverable making cluster unable to form itself. Also Liveness probes kill pods in a loop, because of same reason.

andreygolev avatar Feb 17 '20 11:02 andreygolev

The chart is not perfect, it could depends of the underlying hardware performance (if dkron take too much time to boot and exceed the deadline) and enter to the dead circle.

Have you already fix the issue on your side ? I will host the chart on my own as requested @victorcoder If you want submit PR and help us to have a great helm chart :+1:

jjsaunier avatar Feb 17 '20 14:02 jjsaunier

Mine implementation attempts to use stable hostnames with StatefulSet and persistent drives. But anyway, also still working not good, but because of a missing thing in a Dkron itself which i'm trying to workaround/fix also :)

When there will be something to show, I'll open PR.

andreygolev avatar Feb 17 '20 14:02 andreygolev

ok!

I'm interested by statefulset approach (I use deployment currently), because I tried (and failed), the hostname is changing on pod eviction (restart, fail) it missmatch (because host seem's to be persisted into local db and not refreshed) at the reboot on k8s. So I moved to deployment due to schrodinger syndrome

I don't know if it's a bug or somthing well known. Also deployment have a drawback, dkron seems to not evict agent (which could failed, and restarted by k8s) and the list of agent with state "left" looks exponential, some gc could be great for left state

jjsaunier avatar Feb 17 '20 14:02 jjsaunier

There is a GC actually. There's "Reap" interval which is 24h by default. It removes these "left" and "failed" nodes after some time. Also you can send "raft left-peer" to a Leader node on pod termination event. That's currently what I'm doing to handle "server" Dkron pod restarts. In case of non-server agents I just don't care. They'll be reaped later.

andreygolev avatar Feb 17 '20 14:02 andreygolev

Ok, interesting, do you have snippet of code about "raft left-peer" to share ?

I didn't see problems on pod termination, I guess signal are interpreted to handle a "graceful shutdown", at least never experienced issue or weird "behavior"

jjsaunier avatar Feb 17 '20 15:02 jjsaunier

I have it, but it's kinda still ugly and requires some patches to Dkron which are not there currently :)

Script is waiting for SIGTERM, and when in catches it, it terminates Dkron, then runs "raft remove-peer" Dkron is recompiled with Reap interval 1 minute, instead of 24h, because we're running in Kubernetes and don't expect node to come back with same IP. Also, there's an endpoint http://dkron-server-api which is a Kubernetes Service that returns IP of a Leader pod, because you have to send "remove-peer" to a Leader pod. So, in order to get Leader pod IP, I had to add a web endpoint to Dkron which returns status 200 for a Leader pod and have it in Readiness probe. Kinda that :)

       command:
          - "/bin/sh"
          - "-cx"
          - |
            set -o pipefail

            _term() {
             echo "Caught SIGTERM"
             kill -TERM "$child"
             echo "Waiting for 2 minutes, as it the approximate time to reap left node"
             sleep 120
             echo "Removing self from Raft"
            /opt/dkron/bin/dkron raft remove-peer --peer-id=$(hostname) --rpc-addr="$(wget -q -O - http://dkron-server-api:8080/v1/leader?pretty=true|jq .Tags.rpc_addr -r)"
             echo "Done"
            }

            if [ -z "$POD_IP"  ]; then
              POD_IP=$(hostname -i)
            fi
            FQDN_SUFFIX="${STATEFULSET_NAME}.${STATEFULSET_NAMESPACE}.svc.cluster.local"
            NODE_NAME="$(hostname -s).${FQDN_SUFFIX}"

            JOIN_PEERS=""
            for i in $( seq 0 $((${INITIAL_CLUSTER_SIZE} - 1)) ); do
              JOIN_PEERS="${JOIN_PEERS}${JOIN_PEERS:+ }${STATEFULSET_NAME}-${i}.${FQDN_SUFFIX}"
            done

            # Require multiple loops in the case of unstable DNS resolution
            SUCCESS_LOOPS=5
            while [ "$SUCCESS_LOOPS" -gt 0 ]; do
              ALL_READY=true
              JOIN_LAN=""
              for THIS_PEER in $JOIN_PEERS; do
                  # Make sure we can resolve hostname and ping IP
                  if PEER_IP="$( ( ping -c 1 $THIS_PEER || true ) | awk -F'[()]' '/PING/{print $2}')" && [ "$PEER_IP" != "" ]; then
                    if [ "${PEER_IP}" != "${POD_IP}" ]; then
                      JOIN_LAN="${JOIN_LAN}${JOIN_LAN:+ } --retry-join=$THIS_PEER"
                    fi
                  else
                    ALL_READY=false
                    break
                  fi
              done
              if $ALL_READY; then
                SUCCESS_LOOPS=$(( SUCCESS_LOOPS - 1 ))
                echo "LAN peers appear ready, $SUCCESS_LOOPS verifications left"
              else
                echo "Waiting for LAN peer $THIS_PEER..."
              fi
              sleep 1s
            done

            LEADER=$(wget -q -O - http://dkron-server-api:8080/v1/leader?pretty=true|jq .Tags.rpc_addr -r)

            if [ -z "${LEADER}" ]; then
                BOOTSTRAP_EXPECT="--bootstrap-expect $( echo "$JOIN_PEERS" | wc -w )"
            else
                BOOTSTRAP_EXPECT=""
            fi

            trap _term SIGTERM
            /opt/dkron/bin/dkron agent \
              --server \
              --bind-addr=0.0.0.0 \
              --statsd-addr=127.0.0.1:9125 \
              --log-level=debug \
              --profile=wan \
              ${BOOTSTRAP_EXPECT} \
              ${JOIN_LAN} &

            child=$!
            wait "$child"
            sleep 10

andreygolev avatar Feb 17 '20 16:02 andreygolev

Require multiple loops in the case of unstable DNS resolution

:joy:

haha, yes, it's a nice script indeed :) thanks for sharing !

jjsaunier avatar Feb 17 '20 17:02 jjsaunier

Chart merged in https://github.com/distribworks/dkron-helm

vcastellm avatar Mar 05 '23 22:03 vcastellm

Fixed in recent chart and dkron modifications.

vcastellm avatar Feb 11 '24 17:02 vcastellm