graphql-engine icon indicating copy to clipboard operation
graphql-engine copied to clipboard

/healthz route shouldn't return 200 if there are inconsistencies

Open jgoux opened this issue 2 years ago • 2 comments

Is your proposal related to a problem?

We deploy Hasura in AWS cloud using the ECS Fargate service. The usual pattern for deploying fargate services is to have a load balancer in front of Fargate that tests containers health and sends them traffic if they're considered healthy.

The issue with Hasura's /healthz route is that it returns 200 even if there are inconsistencies in the schema. (see https://hasura.io/docs/latest/graphql/core/api-reference/health.html#response)

In our case, we use remote schemas and there may be inconsistencies during the deployment of a new version of Hasura + the remote schema, it takes some time for Hasura to acknowledge the new schema. But as the /healthz route is responding 200 to the load balancer, Hasura is already accepting traffic that it can't handle properly.

Describe the solution you'd like

At the very least, I'd like to be able to customize the code returned by Hasura when it's not totally healthy (like using another 2XX code) so the load balancer only sends traffic to Hasura if everything is 100% ok. It could be an environment variable that overwrite the default 200 code.

Describe alternatives you've considered

Right now we would have to set up a whole proxy in front of Hasura just to rewrite its /healthz route.

If the feature is approved, would you be willing to submit a PR?

I'm not super familiar with Haskell but I think it wouldn't be a big change : https://github.com/hasura/graphql-engine/blob/3773ba98b0edbe6888bab6b32f9802f329705d9f/server/src-lib/Hasura/Server/App.hs#L961-L969

jgoux avatar Dec 10 '21 18:12 jgoux

Hi @jgoux, thank you for opening this issue.

In general, Hasura is considered healthy even if there are inconsistencies in the schema, because it can continue to serve for the parts that are consistent.

However we understand that in some scenarios, one may not want Hasura to serve queries in case of inconsistent schema. We would approach this requirement, by providing a configuration option, say "inconsistentSchemaIsHealthy" (true by default). If this is set to false, then the /healthz will throw 500 in case of inconsistent schema.

We have added the requested feature in our backlog. We will update when we are able to prioritise this. In the meanwhile please let us know if above approach works for your requirement.

manasag avatar Sep 01 '22 12:09 manasag

We would approach this requirement, by providing a configuration option, say "inconsistentSchemaIsHealthy" (true by default). If this is set to false, then the /healthz will throw 500 in case of inconsistent schema.

Although, I kinda agree with having this configuration option but I think this is something that users may shoot themselves in the foot with (so maybe the concern I am raising below is a documentation issue).

Suppose a user sets this option in their /healthz endpoint and one table/relationship is having some inconsistency then all the hasura instances are now deemed unavailable and it can't even be fixed naturally over time (i.e. complete downtime).

I think concern here is with inconsistencies that are common during startup:

In our case, we use remote schemas and there may be inconsistencies during the deployment of a new version of Hasura + the remote schema, it takes some time for Hasura to acknowledge the new schema.

Maybe the right configuration option here is something like initialDelaySeconds that Kubernetes also exposes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes

If healthz is probed within the intialDelaySeconds of startup then it returns 500 Unavailable or something.

But...should Hasura API be giving this option or is this better handled by the container orchestrator?

tirumaraiselvan avatar Sep 07 '22 07:09 tirumaraiselvan