airbyte-platform icon indicating copy to clipboard operation
airbyte-platform copied to clipboard

feat(kubernetes): define separate readiness and liveness endpoints for k8s probes

Open justenwalker opened this issue 2 years ago • 16 comments
trafficstars

What

Liveness and readiness checks should not check downstream dependencies because momentary hiccups can take the whole service out and prolong down time.

See

How

Add separate webapp and server api paths to be used specifically for liveness/readiness probes and update the helm chart to use them instead.

The key features are:

  1. They do not have any dependencies on downstream services - they are completely independent
  2. They are separate - readiness and liveness probes are different endpoints; allowing some logic to be added in the future if they should behave differently.

In a more robust system, the readiness probe should fail once the application receives a SIGTERM; but the liveness check should remain OK. This will take the pod out of the service, while allowing connections to drain and work to complete as the application shuts itself down. Currently this is not implemented.

airbyte-webapp

  • add /.well-known/readyz endpoint that returns 200 OK
  • use this endpoint in webapp helm chart in place of /api/v1/health (which checks api + db)

airbyte-server

  • add /health/readiness and /health/liveness endpoints that return 200 OK (via Micronaut's HealthIndicators / Endpoint)
  • use these endpoints in server helm chart

Recommended reading order

  1. airbyte-webapp/nginx/default.conf.template
  2. charts/airbyte-webapp/templates/deployment.yaml
  3. airbyte-server/src/main/resources/application.yml
  4. airbyte-server/src/main/java/io/airbyte/server/LivenessIndicator.java
  5. airbyte-server/src/main/java/io/airbyte/server/ReadinessIndicator.java
  6. charts/airbyte-server/templates/deployment.yaml

Can this PR be safely reverted / rolled back?

  • [x] YES 💚

🚨 User Impact 🚨

End result might change the behavior if the user was previously relying on the health checks hitting the database and causing a restart/outage on the server + webapp.

justenwalker avatar Apr 07 '23 02:04 justenwalker

I have this running in our Dev environment. The endpoints return with 200 responses immediately. I didn't use the chart; but i did modify the deployment to use the new endpoints and they are working as expected.

justenwalker avatar Apr 19 '23 15:04 justenwalker

@justenwalker Micronaut actually supports the liveness and readiness checks via a HealthIndicator implementation (see the section on Customization). This is the preferred way to implement the checks for k8 instead of custom controllers.

jdpgrailsdev avatar May 03 '23 15:05 jdpgrailsdev

@jdpgrailsdev Great; I'll take a look at it.

I mostly program in Go at $dayjob; so I've been out of the loop in new developments in Java and the various frameworks for quite some time. thanks for bringing it to my attention.

justenwalker avatar May 03 '23 15:05 justenwalker

@jdpgrailsdev - I've tried to update the server to use health indicators. I think this is pretty straight-forward, but another pair of eyes on it would be helpful if you have the time. Thanks!

justenwalker avatar May 03 '23 16:05 justenwalker

added missing javadoc and fixed some style/formatting issues

justenwalker avatar May 03 '23 19:05 justenwalker

@justenwalker Thanks for converting to the Micronaut health indicators. Would you be willing to add unit tests for the indicators? I know that the implementations are currently trivial, but it would give us a base to work off of it/when the check logic needs to become more complicated. You could either use regular JUnit5 tests or write it as a Micronaut Test, which is essentially a JUnit5 test with the Micronaut context so that it also tests bean creation/dependency injection. Thanks!

jdpgrailsdev avatar May 04 '23 16:05 jdpgrailsdev

@justenwalker can you tell me more about what you are trying to solve here?

From Airbyte's pov, readiness and liveness checks are basically equivalent, since the server doesn't have a notion of queues and cannot serve traffic if it's up and not connected to the database.

davinchia avatar May 04 '23 16:05 davinchia

@justenwalker can you tell me more about what you are trying to solve here?

From Airbyte's pov, readiness and liveness checks are basically equivalent, since the server doesn't have a notion of queues and cannot serve traffic if it's up and not connected to the database.

That may be true, but its better to not restart the application endlessly when the database is down - which is what will happen if you tie liveness checks to db health (or any up-stream/dependency signal for that matter).

In a cluster of more than 1 server (for HA/Redundancy) it is unlikely that the a dependency would be unreachable/down for 1 server but reachable for others; so it doesn't make sense to remove it from the loadbalancer or restart it on a dependency failure, because you'll end up removing/restarting them all at once. It's better to keep them up and wait for the dependency to come back.

Heath probes should indicate if the component itself is healthy, not if the system as a whole is healthy. It's a useful signal to have the aggregate health signal for alerting; but these particular probes have specific meaning inside Kubernetes.

  • Health (in general):
    • Description : The service is operational - has a valid db connection, is alive, functioning, everything is working as expected.
    • This signal is useful for alerting, but not very useful as a Kubernetes probe. It has unintended side-effects like shutting the server down when one of its dependencies is acting up - making the down-time worse; causing a cascade of failures and thundering-herd re-connections.
    • This signal should not be used for either liveness or readiness checks
  • Readiness:
    • Description: The server is listening on the desired port and is accepting connection and is not in the process of shutting down.
    • Kubernetes: If the server is not ready, it will be taken out of the loadbalancer. This will prevent new connections from arriving. This is useful for more graceful handling in the case of a rolling restart or an upgrade so API connections don't get dropped. With help from the application, if it receives a SIGTERM -- it SHOULD fail its readiness checks and use its remaining grace period to handle existing connections before it terminates or is killed by the orchestrator.
  • Liveness:
    • Description: The server is "Alive". ie. not dead/live-locked. Not stuck in a bad program state where connections will be accepted but nothing will happen to them.
    • Kubernetes: If the server is not live it will be restarted by Kubernetes. The liveness probe should remain healthy while the readiness probe is unhealthy during a shut-down sequence so that it isn't killed by the Orchestrator. Failures here are used more as a last resort. If your program can detect if its in a bad state internally it should just crash rather than relying on the probe.

In summary:

If health checks are consolidated into one endpoint:

  • You cannot support graceful shut-down of network requests in this configuration
  • If you fail health checks due to downstream dependency issues (like db health) your service will restart which could make things worse - extending the down-time.

At minimum, the health check endpoint that is used for Kubernetes probes should not fail if the db is down and not report the health of any other dependency (like Temporal)

justenwalker avatar May 04 '23 17:05 justenwalker

I've added tests for the indicators.

I've also tried to implement the desired readiness check behavior. I'm not too familiar with the intricacies of Micronaut; so if there is a different event or method to accomplish this, let me know.

justenwalker avatar May 05 '23 06:05 justenwalker

Thanks for the thoughts @justenwalker . Combining readiness/liveness into one endpoint was intentional - this was the simplest/easiest way to prevent previous outages that happened due to pathological error cases.

I agree separating the readiness/liveness checks are useful. However, what each checks tests for is something the Airbyte team needs to take a stab at. Since Airbyte Cloud inherits the OSS server, changing these checks affects deployment and uptime monitoring and needs to go through more testing before being merged in.

My suggestion is to keep the readiness/liveness code separation and have these still backed by the current health check handler.

W.r.t the webapp readiness checks - let's leave that out since the webapp set up is something we will change soon.

davinchia avatar May 08 '23 15:05 davinchia

@davinchia

My suggestion is to keep the readiness/liveness code separation and have these still backed by the current health check handler.

I believe this renders them useless - since they would be dependent on the database and thus violate Kubernetes expectations.

How about this instead:

  • We keep the new readiness handlers / indicators
  • Provide new helm values to override the health-check paths.
  • Use the existing endpoint/api/v1/health as the default path for those checks

This allows the Cloud Service to continue to use the charts without any modifications -- it's backwards compatible; but, allows others (like myself) who have issues with the current health check system to instead opt into the new endpoints.

justenwalker avatar May 08 '23 16:05 justenwalker

it's backwards compatible

I like this proposal.

I would actually argue database checks should be part of the readiness probe, since the server as currently written has this built-in assumption. We made sure the probe configuration was adjusted to account for this.

I agree the liveness probes as written in this PR are ok.

I believe this renders them useless - since they would be dependent on the database and thus violate Kubernetes expectations. but, allows others (like myself) who have issues with the current health check system

It feels like the database being part of the Kubernetes checks is a big point for you I'm not sure I understand fully. Although I agree with you on the principle of Kubernetes probes, I don't understand the urgent risk of the current system.

Current thoughts:

  • It is ok for readiness probes to check downstream dependencies if the probe configuration accounts for this.
  • I agree splitting out the liveness checks avoids the case of a recreate loop. I'm not sure how plausible this is with the default probe configurations Airbyte ships with. This has not occurred in our experience operating Airbyte.
  • Gracefully shutting down connections is a hole in the server today. I believe this requires logic to catch the SIGTERM, so part of the work is orthogonal to this (although related).

Can you help me understand more about the practical issues you are seeing with this current check system? For example, are you seeing requests drop when rolling the server pods? This would help me combine my experience operating Airbyte with your concerns!

davinchia avatar May 08 '23 18:05 davinchia

@davinchia

I would actually argue database checks should be part of the readiness probe, since the server as currently written has this built-in assumption. We made sure the probe configuration was adjusted to account for this. It feels like the database being part of the Kubernetes checks is a big point for you I'm not sure I understand fully. Although I agree with you on the principle of Kubernetes probes, I don't understand the urgent risk of the current system.

It's not necessarily just my opinion that readiness/liveness checks shouldn't include down-stream dependencies. I linked to a few articles in my initial PR:

Relevant Links

The readiness probe doesn't include dependencies to services such as:

databases database migrations APIs third party services

Don't depend on downstream dependencies, such as other services or databases in your probe. If the dependency has a hickup, or for example, a database is restarted, removing your Pods from the Load Balancers rotation will likely only make the downtime worse.

... consider what happens if there is a small, temporary increase in latency to one dependent service—maybe due to network congestion, a garbage-collection pause, or a temporary increase in load for the dependent service. If latency to the dependency increases to even slightly above one second, the readiness probe will fail and Kubernetes will no longer route traffic to the pod. Since all of the pods share the same dependency, it is very likely that all pods backing the service will fail the readiness probe at the same time. This will result in all pods being removed from the service routing. With no pods backing the service, Kubernetes will return HTTP 404, the default backend, for all requests to the service ... ... Avoid checking dependencies in liveness probes. Liveness probes should be inexpensive and have response times with minimal variance.

In general, if the pod is technically ready, even if it cannot function perfectly, it should not fail the readiness probe. A good compromise is to implement a “degraded mode,” for example, if there is no access to the database, answer read requests that can be addressed by local cache and return 503 (service unavailable) on write requests. Ensure that downstream services are resilient to a failure in the upstream service.

Having a health-checks depend on a backend dependency can be problematic, as an outage within your dependency can cause you to have an outage as Kubernetes restarts your containers.

Avoid using the Spring Boot default health check API for Kubernetes probes. In most cases you can just always return 200 in the liveness probe. Failing readiness probes are appropriate when containers are undergoing the initialization process or overloaded.

Responses

It is ok for readiness probes to check downstream dependencies if the probe configuration accounts for this.

When you say "account for" do you mean longer timeouts? More failures required? While this may help avoid taking the system down for momentary outages it doesn't help for slightly longer ones or issues with the dependency itself.

I agree splitting out the liveness checks avoids the case of a recreate loop. I'm not sure how plausible this is with the default probe configurations Airbyte ships with. This has not occurred in our experience operating Airbyte.

If a liveness check fails, the k8s kills the pod. In the implementation above, the liveness check is simply returning 200 OK. So it should never cause the k8s to restart the pod (unless the pod stops listening).

Gracefully shutting down connections is a hole in the server today. I believe this requires logic to catch the SIGTERM, so part of the work is orthogonal to this (although related).

I believe the event system that I've hooked into may be the way to go with this. When Micronaut gets a SIGTERM it sends a ShutdownEvent which can be used to fail the readiness probe (And importantly, NOT the liveness probe) allowing the server to remove itself from the Kubernetes service while also completing any pending requests within the grace period.

Can you help me understand more about the practical issues you are seeing with this current check system? For example, are you seeing requests drop when rolling the server pods? This would help me combine my experience operating Airbyte with your concerns!

Here is a scenario I encountered: I had a 3-server / 3 webapp setup in Kuberentes. The database started to become slow to respond. It wasn't down, but queries were taking longer. As a result, the readiness probe failed and removed a server from the LB. This put additional pressure on the remaining 2 servers, which were now sending more requests down their DB connections causing larger latencies which in turn caused them to fail their health checks. This took down another server causing the remaining requests to go to the single last server. Eventually this server's health checks failed and caused a full outage.

While this was happening; the liveness probe also began to fail - since they are tied to the same endpoint. In addition to the servers being pulled from the LB, they were getting killed and starting back up - re-establishing connections to the database causing further strain on the database.

Because the webapp was using /api/v1/heath for its own checks, it was also removing itself from the LB causing the webui to go down completely.

Summary

It mostly comes down to this:

When you have multiple instances of a service, and their readiness / liveness checks rely on a shared down-stream system (Database, Temporal, etc...); you take down ALL instances of the component when any of the dependencies fail their readiness checks.

In this case, the server cannot gracefully degrade. It cannot respond with something more intelligent like a 503 Service Unavailable Response. Furthermore, if these checks are chained together like webapp -> server, then the UI cannot respond with "Something went Wrong" or "We're having some difficulties, try again later!" -- it will also be taken offline.

When evaluating what a readiness probe/liveness probe should do and what checks it performs, I encourage some scrutiny around what you think is the most appropriate response for the component to have:

  • Should all servers / webapps remove themselves from the LB when a shared dependency is unavailable?
  • Should all servers / webapps restart when a shared dependency is unavailable?
  • Will these tactics help restore service faster? (will this apply back-pressure or will to cause a thundering herd)
  • Does this result in a good user experience? (Does it allow for graceful degradation or will a user just see a blank screen, or get a timeout response)

One more thing

There is also a notion of a Startup Probe that is useful if you have a boot sequence that can take a while (establishing db connections, warming cache, etc...). This is probed first until it returns a success -- and is never called again. If you are trying to use readiness probes as start up probes - they sometimes have conflicting requirements and it might be a good idea to split them out.

I believe that the readiness indicator as I've implemented it accounts for this since it sets the ready flag after it receives a ServiceStartupEvent and only sets it back to false on a ShutDownEvent -- but perhaps there are other considerations that this doesn't take into account.

justenwalker avatar May 08 '23 19:05 justenwalker

@davinchia I'll concede these points with regards to dependency checks in readiness probes:

  1. If you are using readiness checks in place of startup checks to make sure the DB initialization was performed and any DB migrations happened. However, readiness checks happen periodically after the service is ready - so these checks should be "sticky" - you shouldn't depend on pinging the database every time after. With the Readiness Indicator we can use the ServiceStartEvent or we can fire a specific event after the db initialization phase is complete

See: https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html

make sure that your Readiness Probe includes database initialization/migration the simplest way to achieve this is to start the HTTP server listening only after the initialization finished (e.g. Flyway DB migration etc), i.e. instead of changing the health check status, just don't start the web server until the DB migration is complete [2] ... do not depend on external dependencies (like data stores) for your Readiness/Liveness checks as this might lead to cascading failures

  1. If there is a reliable signal that the current instance of the application is having an issue specific to its use of a dependency and needs to shed load because it is overwhelmed; it might be a good reason to fail the readiness check.

For example: If each instance has 100 DB connections in a pool, and your specific instance is at capacity (100 used connections) -- failing the readiness check and removing the server instance from the LB to prevent more requests consuming new DB connections could be a good tactic. This allows requests in-flight to complete and transparently moves new ones to instances that (may) have capacity -- could help prevent hot spots.

justenwalker avatar May 09 '23 15:05 justenwalker

Some Updates to review:

  • @davinchia As requested, I removed the extra path I added for the airbyte-webapp component. I could probably make due with /.
  • @jdpgrailsdev I was able to consolidate the listeners into HealthCheckHandler by using @EventListener instead of implementing an interface. This allows all that logic to remain in one place.
  • @davinchia As discussed, I added an optional helm configuration to change the HTTP Path used by liveness and readiness checks. It is set to use /api/v1/health like it currently does; and therefore is backwards compatible; It effectively makes the new readiness/liveness indicators "opt-in" since you'll have to change the path in values.yaml to get the behavior implemented in this PR.
  • I updated the helm chart README.md by running helm-docs; looks like this hasn't been run in a while -- might want to add a gradle task for that?

There is one idea I thought of but didn't ultimately implement: I could create and fire a custom event such as DatabaseReadyEvent once the flyway migration checks are run in DatabaseEventListener, and use that to set the Ready status on the service, instead of using ServerStartupEvent

justenwalker avatar May 21 '23 05:05 justenwalker

@davinchia / @jdpgrailsdev - rebased on main

justenwalker avatar Jun 09 '23 00:06 justenwalker