kube-arangodb
kube-arangodb copied to clipboard
ActiveFailover deployment readiness probe error
Transferring conversation over from Slack.
I have an ActiveFailover ArangoDB instance deployed on OpenShift 4.7 with v1.1.9 of the ArangoDB operator. The active and failover servers both exhibit the following error:
Readiness probe failed: {"level":"info","path":"/secrets/cluster/jwt/token","time":"2021-07-22T13:04:34Z","message":"Try to use file"} {"level":"info","token":"token","time":"2021-07-22T13:04:34Z","message":"Using JWT Token"} {"level":"error","error":"Unexpected code: 503 - {\"code\":503,\"error\":true,\"errorMessage\":\"service unavailable\",\"errorNum\":503}","time":"2021-07-22T13:04:34Z","message":"Fatal"}
After sometime, the ArangoDB will remove the pod and deploy a new one at which point the failover becomes passive, the new becomes the failover, rinse repeat. The pods running the agents in the deployment do not exhibit this behavior.
The deployment was created with:
apiVersion: "database.arangodb.com/v1"
kind: "ArangoDeployment"
metadata:
name: "arangodb-production"
spec:
mode: ActiveFailover
single:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 25Gi
environment: Production
image: 'arangodb/arangodb:3.7.11'
imagePullPolicy: Always
storageEngine: RocksDB
networkAttachedVolumes: true
externalAccess:
type: NodePort
disableIPv6: true
bootstrap:
passwordSecretNames:
root: Auto
I'm seeing the same problem on my deployments. @millerjl1701, where you able to resolve this?
When trying to manually run the life cycle probe from the pod:
$ /lifecycle/tools/arangodb_operator lifecycle probe --endpoint=/_admin/server/availability --ssl --auth
{"level":"info","path":"/secrets/cluster/jwt/token","time":"2021-09-13T00:19:48Z","message":"Try to use file"}
{"level":"info","token":"token","time":"2021-09-13T00:19:48Z","message":"Using JWT Token"}
{"level":"error","error":"Unexpected code: 503 - {\"code\":503,\"error\":true,\"errorMessage\":\"service unavailable\",\"errorNum\":503}","time":"2021-09-13T00:19:48Z","message":"Fatal"}
I also tried setting a manual port-forward to the pod in question and tried to open the UI:
kubectl -n dev port-forward hive-arangodb-sngl-dsu5kvyj-db8238 8529:8529
GET https://localhost:8529/
{
"error": true,
"errorNum": 1496,
"errorMessage": "not a leader",
"code": 503
}
GET https://localhost:8529/_admin/server/availability
{
"code": 503,
"error": true,
"errorMessage": "service unavailable",
"errorNum": 503
}
The readiness probe is constantly failing because the API fails to respond since its not a leader.
When I kill the current leader pod, the active standby starts to pass the readiness probe since its now the leader.
Maybe checking for the cluster endpoints is a better mechanism for readiness?
$ /lifecycle/tools/arangodb_operator lifecycle probe --endpoint=/_api/cluster/endpoints --ssl --auth
{"level":"info","path":"/secrets/cluster/jwt/token","time":"2021-09-13T00:42:25Z","message":"Try to use file"}
{"level":"info","token":"token","time":"2021-09-13T00:42:25Z","message":"Using JWT Token"}
{"level":"info","time":"2021-09-13T00:42:25Z","message":"Check passed"}
I can't speak to the decision to respond with a 503 from the /_admin/server/availability endpoint
for an ArangoDB follower server (as per documented in https://www.arangodb.com/docs/stable/http/administration-and-monitoring.html#return-whether-or-not-a-server-is-available) but to use the endpoint for a K8s Liveness probe seems to go against the intent of running workloads/pods on K8s. Liveness/Readiness probes should represent if a workload is working as expected. It becomes exceedingly complicated to monitor the health of an entire K8s cluster when Pods are reporting as "NotReady" for their own routing purposes. Additionally the amount of K8s warning Events that are emitted is staggering for an otherwise "working as intended" Pod. In this scenario an ArangoDB follower Pod is working as expected the container should report as "Ready".
Suggestion: Label leader / follower Pods for ActiveFailover Deployments and use Service selectors to manage routing to the leader Pod. The Hashicorp Vault project does this by setting the "vault-active" label to "true" or "false" which allows Services to use the label for routing traffic within K8s.
Hello!
There is plan to change this part, based on additional label set on Active Server (like you described). In 1.3.x we will focus on ActiveFailover failover logic, to reduce failover window as much as possible.
I will link all required PRs to this issue.
Best, Adam.
Have the same issue! One of the single servers never becomes ready because of that. Did anyone solve this? @sarahhenkens?

I'm experiencing the same issue. Any updates on this?
Hello! In 1.3.0 we will change the logic of services to cover this case (additional label based on serving condition).
@ajanikow, looking forward as this is the only pod in my cluster that is not "marching in line"!
(combined from similar events): Readiness probe failed: {"level":"info","path":"/secrets/cluster/jwt/token","time":"2022-04-26T08:40:17Z","message":"Try to use file"} {"level":"info","token":"token","time":"2022-04-26T08:40:17Z","message":"Using JWT Token"} {"level":"error","error":"Unexpected code: 503 - {\"code\":503,\"error\":true,\"errorMessage\":\"service unavailable\",\"errorNum\":503}","time":"2022-04-26T08:40:17Z","message":"Fatal"}

Hello guys!
Same problem here in Cluster mode. Single, fine.
Does anyone have any solution? Or should we wait for version 1.3.0?
Hello!
Already fixed in master (AF node is not anymore NotReady, proper label is added in runtime)
Best, Adam
- it was not affecting Cluster mode, in cluster mode pod will go into not ready state if it does not have connection or is broken