kong
kong copied to clipboard
Timeout in liveness / readiness probes leading to outage
Is there an existing issue for this?
- [X] I have searched the existing issues
Kong version ($ kong version
)
Kong 3.6.1
Current Behavior
When Kong loses connectivity to the Database, the first couple of requests to /status
and /metrics
take too long to return. Meanwhile, Kubernetes uses the /status
to probe for the readiness / liveness status and since it got a timeout, it assumed that the pod was unhealthy.
Until Kubernets decided to restart the container, Kong was proxing requests without a problem.
After looking at kong.api.routes.health.lua
we realized that kong.db:connect()
instructing to test connectivity is being executed both synchronously and without passing a timeout leading to latency of 5+ secs of the first requests.
The sequence of events were:
Several "harmless" entries of:
[error] 1322#0: *664431 [lua] health.lua:61: fn(): failed to connect to PostgreSQL during /status endpoint check: [PostgreSQL error] timeout, client: 10.145.65.245, server: kong_status, request: "GET /status HTTP/1.1", host: "10.145.43.33:8802"
[error] 1322#0: *660572 [kong] exporter.lua:353 prometheus: failed to reach database while processing/metrics endpoint: timeout, client: 10.145.46.50, server: kong_status, request: "GET /metrics HTTP/1.1", host: "10.145.43.33:8802"
Followed by a surprise SIGQUIT:
2024/09/16 07:03:42 [notice] 1#0: exit
2024/09/16 07:03:42 [notice] 1#0: worker process 1322 exited with code 0
2024/09/16 07:03:42 [notice] 1#0: signal 17 (SIGCHLD) received from 1322
2024/09/16 07:03:41 [notice] 1322#0: exit
2024/09/16 07:03:41 [notice] 1322#0: exiting
2024/09/16 07:03:38 [notice] 1#0: signal 29 (SIGIO) received
2024/09/16 07:03:38 [notice] 1#0: worker process 1323 exited with code 0
2024/09/16 07:03:38 [notice] 1#0: signal 17 (SIGCHLD) received from 1323
2024/09/16 07:03:37 [notice] 1323#0: exit
2024/09/16 07:03:37 [notice] 1323#0: exiting
2024/09/16 07:03:37 [notice] 1323#0: gracefully shutting down
2024/09/16 07:03:37 [notice] 1322#0: gracefully shutting down
2024/09/16 07:03:37 [notice] 1#0: signal 3 (SIGQUIT) received from 1990, shutting down
Followed by the apparent shutdown process, we see:
Please note that the pod was not able to die even though it received a SIGQUIT. After the pod regained the DB connectivity, it remained the same. The reason for this is likely a crashloopbackoff
while shutting down, as it tried to save some state in a DB it had no connectivity.
Expected Behavior
Kong had no problem to route request without DB connectivity until it is taken out from the cluster due to /status
timeout.
As the DB is not in the hot path, we did not expect an outage on a DB restart.
In our opinion, kong.db:connect()
to test connectivity should be done either asynchronously or have a timeout
Steps To Reproduce
- Setup a Kong in traditional mode
- Disable the DB
- Make a request to any route (should work)
- Make a request to /status (should have a latency spike for the first attempts)
Anything else?
No response