kube-arangodb icon indicating copy to clipboard operation
kube-arangodb copied to clipboard

ActiveFailover deployment readiness probe error

Open millerjl1701 opened this issue 4 years ago • 9 comments

Transferring conversation over from Slack.

I have an ActiveFailover ArangoDB instance deployed on OpenShift 4.7 with v1.1.9 of the ArangoDB operator. The active and failover servers both exhibit the following error:

Readiness probe failed: {"level":"info","path":"/secrets/cluster/jwt/token","time":"2021-07-22T13:04:34Z","message":"Try to use file"} {"level":"info","token":"token","time":"2021-07-22T13:04:34Z","message":"Using JWT Token"} {"level":"error","error":"Unexpected code: 503 - {\"code\":503,\"error\":true,\"errorMessage\":\"service unavailable\",\"errorNum\":503}","time":"2021-07-22T13:04:34Z","message":"Fatal"}

After sometime, the ArangoDB will remove the pod and deploy a new one at which point the failover becomes passive, the new becomes the failover, rinse repeat. The pods running the agents in the deployment do not exhibit this behavior.

The deployment was created with:

apiVersion: "database.arangodb.com/v1"
kind: "ArangoDeployment"
metadata:
  name: "arangodb-production"
spec:
  mode: ActiveFailover
  single:
    volumeClaimTemplate:
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 25Gi
  environment: Production
  image: 'arangodb/arangodb:3.7.11'
  imagePullPolicy: Always
  storageEngine: RocksDB
  networkAttachedVolumes: true
  externalAccess:
    type: NodePort
  disableIPv6: true
  bootstrap:
    passwordSecretNames:
      root: Auto

millerjl1701 avatar Jul 23 '21 19:07 millerjl1701

I'm seeing the same problem on my deployments. @millerjl1701, where you able to resolve this?

sarahhenkens avatar Sep 13 '21 00:09 sarahhenkens

When trying to manually run the life cycle probe from the pod:

$ /lifecycle/tools/arangodb_operator lifecycle probe --endpoint=/_admin/server/availability --ssl --auth

{"level":"info","path":"/secrets/cluster/jwt/token","time":"2021-09-13T00:19:48Z","message":"Try to use file"}
{"level":"info","token":"token","time":"2021-09-13T00:19:48Z","message":"Using JWT Token"}
{"level":"error","error":"Unexpected code: 503 - {\"code\":503,\"error\":true,\"errorMessage\":\"service unavailable\",\"errorNum\":503}","time":"2021-09-13T00:19:48Z","message":"Fatal"}

I also tried setting a manual port-forward to the pod in question and tried to open the UI:

kubectl -n dev port-forward hive-arangodb-sngl-dsu5kvyj-db8238 8529:8529

GET https://localhost:8529/

{
"error": true,
"errorNum": 1496,
"errorMessage": "not a leader",
"code": 503
}

GET https://localhost:8529/_admin/server/availability

{
"code": 503,
"error": true,
"errorMessage": "service unavailable",
"errorNum": 503
}

The readiness probe is constantly failing because the API fails to respond since its not a leader.

When I kill the current leader pod, the active standby starts to pass the readiness probe since its now the leader.

sarahhenkens avatar Sep 13 '21 00:09 sarahhenkens

Maybe checking for the cluster endpoints is a better mechanism for readiness?

$ /lifecycle/tools/arangodb_operator lifecycle probe --endpoint=/_api/cluster/endpoints --ssl --auth
{"level":"info","path":"/secrets/cluster/jwt/token","time":"2021-09-13T00:42:25Z","message":"Try to use file"}
{"level":"info","token":"token","time":"2021-09-13T00:42:25Z","message":"Using JWT Token"}
{"level":"info","time":"2021-09-13T00:42:25Z","message":"Check passed"}

sarahhenkens avatar Sep 13 '21 00:09 sarahhenkens

I can't speak to the decision to respond with a 503 from the /_admin/server/availability endpoint for an ArangoDB follower server (as per documented in https://www.arangodb.com/docs/stable/http/administration-and-monitoring.html#return-whether-or-not-a-server-is-available) but to use the endpoint for a K8s Liveness probe seems to go against the intent of running workloads/pods on K8s. Liveness/Readiness probes should represent if a workload is working as expected. It becomes exceedingly complicated to monitor the health of an entire K8s cluster when Pods are reporting as "NotReady" for their own routing purposes. Additionally the amount of K8s warning Events that are emitted is staggering for an otherwise "working as intended" Pod. In this scenario an ArangoDB follower Pod is working as expected the container should report as "Ready".

Suggestion: Label leader / follower Pods for ActiveFailover Deployments and use Service selectors to manage routing to the leader Pod. The Hashicorp Vault project does this by setting the "vault-active" label to "true" or "false" which allows Services to use the label for routing traffic within K8s.

kjvellajr avatar Sep 13 '21 18:09 kjvellajr

Hello!

There is plan to change this part, based on additional label set on Active Server (like you described). In 1.3.x we will focus on ActiveFailover failover logic, to reduce failover window as much as possible.

I will link all required PRs to this issue.

Best, Adam.

ajanikow avatar Sep 14 '21 06:09 ajanikow

Have the same issue! One of the single servers never becomes ready because of that. Did anyone solve this? @sarahhenkens? ksnip_20220112-091018

mysticaltech avatar Jan 12 '22 08:01 mysticaltech

I'm experiencing the same issue. Any updates on this?

espizo avatar Feb 14 '22 10:02 espizo

Hello! In 1.3.0 we will change the logic of services to cover this case (additional label based on serving condition).

ajanikow avatar Mar 24 '22 23:03 ajanikow

@ajanikow, looking forward as this is the only pod in my cluster that is not "marching in line"!

(combined from similar events): Readiness probe failed: {"level":"info","path":"/secrets/cluster/jwt/token","time":"2022-04-26T08:40:17Z","message":"Try to use file"} {"level":"info","token":"token","time":"2022-04-26T08:40:17Z","message":"Using JWT Token"} {"level":"error","error":"Unexpected code: 503 - {\"code\":503,\"error\":true,\"errorMessage\":\"service unavailable\",\"errorNum\":503}","time":"2022-04-26T08:40:17Z","message":"Fatal"}

ksnip_20220426-104538

mysticaltech avatar Apr 26 '22 08:04 mysticaltech

Hello guys!

Same problem here in Cluster mode. Single, fine.

Does anyone have any solution? Or should we wait for version 1.3.0?

nuvme-devops avatar Oct 13 '22 13:10 nuvme-devops

Hello!

Already fixed in master (AF node is not anymore NotReady, proper label is added in runtime)

Best, Adam

ajanikow avatar Feb 21 '23 18:02 ajanikow

  • it was not affecting Cluster mode, in cluster mode pod will go into not ready state if it does not have connection or is broken

ajanikow avatar Feb 21 '23 18:02 ajanikow