kong icon indicating copy to clipboard operation
kong copied to clipboard

Timeout in liveness / readiness probes leading to outage

Open rodolfobrunner opened this issue 5 months ago • 2 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Kong version ($ kong version)

Kong 3.6.1

Current Behavior

When Kong loses connectivity to the Database, the first couple of requests to /status and /metrics take too long to return. Meanwhile, Kubernetes uses the /status to probe for the readiness / liveness status and since it got a timeout, it assumed that the pod was unhealthy.

Until Kubernets decided to restart the container, Kong was proxing requests without a problem.

After looking at kong.api.routes.health.lua we realized that kong.db:connect()instructing to test connectivity is being executed both synchronously and without passing a timeout leading to latency of 5+ secs of the first requests.

The sequence of events were:

Several "harmless" entries of: [error] 1322#0: *664431 [lua] health.lua:61: fn(): failed to connect to PostgreSQL during /status endpoint check: [PostgreSQL error] timeout, client: 10.145.65.245, server: kong_status, request: "GET /status HTTP/1.1", host: "10.145.43.33:8802" [error] 1322#0: *660572 [kong] exporter.lua:353 prometheus: failed to reach database while processing/metrics endpoint: timeout, client: 10.145.46.50, server: kong_status, request: "GET /metrics HTTP/1.1", host: "10.145.43.33:8802"

Followed by a surprise SIGQUIT:

2024/09/16 07:03:42 [notice] 1#0: exit
2024/09/16 07:03:42 [notice] 1#0: worker process 1322 exited with code 0
2024/09/16 07:03:42 [notice] 1#0: signal 17 (SIGCHLD) received from 1322
2024/09/16 07:03:41 [notice] 1322#0: exit
2024/09/16 07:03:41 [notice] 1322#0: exiting
2024/09/16 07:03:38 [notice] 1#0: signal 29 (SIGIO) received
2024/09/16 07:03:38 [notice] 1#0: worker process 1323 exited with code 0
2024/09/16 07:03:38 [notice] 1#0: signal 17 (SIGCHLD) received from 1323
2024/09/16 07:03:37 [notice] 1323#0: exit
2024/09/16 07:03:37 [notice] 1323#0: exiting
2024/09/16 07:03:37 [notice] 1323#0: gracefully shutting down
2024/09/16 07:03:37 [notice] 1322#0: gracefully shutting down
2024/09/16 07:03:37 [notice] 1#0: signal 3 (SIGQUIT) received from 1990, shutting down

Followed by the apparent shutdown process, we see: Screenshot 2024-09-16 at 14 09 53

Please note that the pod was not able to die even though it received a SIGQUIT. After the pod regained the DB connectivity, it remained the same. The reason for this is likely a crashloopbackoff while shutting down, as it tried to save some state in a DB it had no connectivity.

Expected Behavior

Kong had no problem to route request without DB connectivity until it is taken out from the cluster due to /status timeout.

As the DB is not in the hot path, we did not expect an outage on a DB restart.

In our opinion, kong.db:connect() to test connectivity should be done either asynchronously or have a timeout

Steps To Reproduce

  1. Setup a Kong in traditional mode
  2. Disable the DB
  3. Make a request to any route (should work)
  4. Make a request to /status (should have a latency spike for the first attempts)

Anything else?

No response

rodolfobrunner avatar Sep 16 '24 13:09 rodolfobrunner