graphql-engine
graphql-engine copied to clipboard
/healthz route shouldn't return 200 if there are inconsistencies
Is your proposal related to a problem?
We deploy Hasura in AWS cloud using the ECS Fargate service. The usual pattern for deploying fargate services is to have a load balancer in front of Fargate that tests containers health and sends them traffic if they're considered healthy.
The issue with Hasura's /healthz
route is that it returns 200 even if there are inconsistencies in the schema. (see https://hasura.io/docs/latest/graphql/core/api-reference/health.html#response)
In our case, we use remote schemas and there may be inconsistencies during the deployment of a new version of Hasura + the remote schema, it takes some time for Hasura to acknowledge the new schema. But as the /healthz
route is responding 200 to the load balancer, Hasura is already accepting traffic that it can't handle properly.
Describe the solution you'd like
At the very least, I'd like to be able to customize the code returned by Hasura when it's not totally healthy (like using another 2XX code) so the load balancer only sends traffic to Hasura if everything is 100% ok. It could be an environment variable that overwrite the default 200 code.
Describe alternatives you've considered
Right now we would have to set up a whole proxy in front of Hasura just to rewrite its /healthz route.
If the feature is approved, would you be willing to submit a PR?
I'm not super familiar with Haskell but I think it wouldn't be a big change : https://github.com/hasura/graphql-engine/blob/3773ba98b0edbe6888bab6b32f9802f329705d9f/server/src-lib/Hasura/Server/App.hs#L961-L969
Hi @jgoux, thank you for opening this issue.
In general, Hasura is considered healthy even if there are inconsistencies in the schema, because it can continue to serve for the parts that are consistent.
However we understand that in some scenarios, one may not want Hasura to serve queries in case of inconsistent schema. We would approach this requirement, by providing a configuration option, say "inconsistentSchemaIsHealthy" (true by default). If this is set to false, then the /healthz
will throw 500 in case of inconsistent schema.
We have added the requested feature in our backlog. We will update when we are able to prioritise this. In the meanwhile please let us know if above approach works for your requirement.
We would approach this requirement, by providing a configuration option, say "inconsistentSchemaIsHealthy" (true by default). If this is set to false, then the /healthz will throw 500 in case of inconsistent schema.
Although, I kinda agree with having this configuration option but I think this is something that users may shoot themselves in the foot with (so maybe the concern I am raising below is a documentation issue).
Suppose a user sets this option in their /healthz endpoint and one table/relationship is having some inconsistency then all the hasura instances are now deemed unavailable and it can't even be fixed naturally over time (i.e. complete downtime).
I think concern here is with inconsistencies that are common during startup:
In our case, we use remote schemas and there may be inconsistencies during the deployment of a new version of Hasura + the remote schema, it takes some time for Hasura to acknowledge the new schema.
Maybe the right configuration option here is something like initialDelaySeconds
that Kubernetes also exposes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes
If healthz is probed within the intialDelaySeconds
of startup then it returns 500 Unavailable or something.
But...should Hasura API be giving this option or is this better handled by the container orchestrator?