osm Kubernetes probes fail for an extra ~60s when Envoy health checks are enabled

Bug description:

When Envoy active health checks are enabled, configured Kubernetes probes (of any type: startup, liveness, or readiness) fail for an extra 60s when the Pod starts up compared to when health checks are not enabled. After that, they succeed normally.

I've already dug into this a bit and here is what I've found so far:

Click me for a lot more details

I've pushed the changes I used to help debug this to my fork. To summarize, I added a /ok endpoint on the bookwarehouse that responds with HTTP 200, configured the bookstore-v1's health check for the bookwarehouse to use that endpoint, and removed the HTTP health check filter from the bookwarehouse, all to try to get down to the simplest possible health check configuration.

Running the demo with these changes (and Prometheus enabled) yields results I wouldn't expect. The first thing I noticed is that Prometheus doesn't even seem able to scrape metrics from bookstore-v1 until it's been up for one minute.

envoy-uptime

envoy_server_uptime{source_namespace="bookstore"}

Looking at the health check statistics, even though I configured health checks to run once per second, I see a one-minute gap where no health checks occur, even after one was recorded as successful. The following graphs show the one-minute gap where 2 attempts have been recorded, 1 of which was successful, then health checks run and succeed once per second after:

envoy-health-checks

envoy_cluster_health_check_attempt envoy_cluster_health_check_success

The Envoy container logs for the bookstore-v1 Pod match the results from these graphs. The first three health checks I see in the logs are each one minute apart:

[2021-07-20 21:00:46.827][1][debug][hc] [source/common/upstream/health_checker_impl.cc:295] [C1] connection/stream error health_flags=/failed_active_hc
[2021-07-20 21:01:46.752][1][debug][hc] [source/common/upstream/health_checker_impl.cc:342] [C3] hc response=200 health_flags=/failed_active_hc
[2021-07-20 21:02:46.683][1][debug][hc] [source/common/upstream/health_checker_impl.cc:342] [C3] hc response=200 health_flags=healthy

(full logs and envoy configs for each bookstore: https://gist.github.com/nojnhuh/dbc3072c0ca9f2c3a554b424b50cbfa3)

Each health check log after those is the same as the third where health_flags=healthy and each occurs 1 second after the previous one. It seems odd to me that the second health check has response=200 but health_flags=/failed_active_hc.

When comparing Envoy logs with bookstore-v2, which is not configured to health check the bookwarehouse, it's pretty clear where the differences start. The following messages appear in the bookstore-v2 logs:

[2021-07-20 21:00:49.790][1][debug][init] [source/common/init/watcher_impl.cc:14] init manager Cluster bookwarehouse/bookwarehouse/local initialized, notifying ClusterImplBase
[2021-07-20 21:00:49.790][1][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:968] adding TLS cluster bookwarehouse/bookwarehouse/local

bookstore-v1's Envoy has the same first message, but then they differ. Eliding some messages:

[2021-07-20 21:00:47.148][1][debug][init] [source/common/init/watcher_impl.cc:14] init manager Cluster bookwarehouse/bookwarehouse/local initialized, notifying ClusterImplBase
[2021-07-20 21:00:50.957][1][debug][main] [source/server/server.cc:219] Envoy is not fully initialized, skipping histogram merge and flushing stats

A group of repeated messages throughout the logs seems to indicate the first one-minute gap of no health checks. This group includes messages like:

[2021-07-20 21:00:55.958][1][debug][main] [source/server/server.cc:219] Envoy is not fully initialized, skipping histogram merge and flushing stats
[2021-07-20 21:00:55.975][1][debug][upstream] [source/common/upstream/logical_dns_cluster.cc:159] DNS refresh rate reset for osm-controller.osm-system.svc.cluster.local, refresh rate 5000 ms
[2021-07-20 21:00:56.899][1][debug][upstream] [source/common/upstream/upstream_impl.cc:279] transport socket match, socket default selected for host with address 0.0.0.0:14001
[2021-07-20 21:00:56.899][1][debug][upstream] [source/common/upstream/strict_dns_cluster.cc:170] DNS refresh rate reset for 0.0.0.0, refresh rate 5000 ms

Following 12 of these groups (which reference a 5000 ms DNS refresh rate), the second health check occurs and the adding TLS cluster message from bookstore-v2 appears:

[2021-07-20 21:01:46.752][1][debug][hc] [source/common/upstream/health_checker_impl.cc:342] [C3] hc response=200 health_flags=/failed_active_hc
[2021-07-20 21:01:46.752][1][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:968] adding TLS cluster bookwarehouse/bookwarehouse/local

Then, Envoy begins receiving Kubernetes probes and responding successfully:

[2021-07-20 21:01:49.282][23][debug][http] [source/common/http/conn_manager_impl.cc:882] [C4][S15199497057362930131] request headers complete (end_stream=true):
':authority', '10.244.0.10:15903'
':path', '/osm-startup-probe'
':method', 'GET'
'user-agent', 'kube-probe/1.21'
'accept', '*/*'
'connection', 'close'

Other requests like from Prometheus to scrape metrics, from the bookbuyer, and to the bookwarehouse also start being logged and succeeding at the same time. However, no health checks occur for another minute. Comparing the logs with bookstore-v2, it's not clear why health checks aren't running every second like they're configured to.

These results are very similar to even when the bookstore-v1 defines no Kubernetes probes at all. The main difference is the Pod is marked Ready just a few seconds after it starts up vs. a minute later. The Envoy logs and Prometheus metrics show the same patterns though.

One hint I found in the Envoy source code is a comment that says that Envoy's workers will not start until it's been fully initialized, which the bookstore-v1 logs do indicate following the initial failing health check. That would also explain why all the other HTTP traffic isn't coming through until the health check passes and Envoy is initialized.

Overall, I'm most curious as to why the health checks don't run once per second like they're configured to for the first two minutes. If they did, then I would expect to see Kubernetes probes passing at about the same time as when no health checks are defined, at least within a health check period.

Affected area (please mark with X where applicable):

Install [ ]
SMI Traffic Access Policy [ ]
SMI Traffic Specs Policy [ ]
SMI Traffic Split Policy [ ]
Permissive Traffic Policy [ ]
Ingress [ ]
Egress [ ]
Envoy Control Plane [X]
CLI Tool [ ]
Metrics [ ]
Certificate Management [ ]
Sidecar Injection [ ]
Logging [ ]
Debugging [ ]
Tests [ ]
Demo [ ]
CI System [ ]

Expected behavior: Kubernetes probes succeed at about the same time as when health checks are not configured.

Steps to reproduce the bug (as precisely as possible): Run the automated demo with health checks enabled (OSM installed with --set OpenServiceMesh.featureFlags.enableEnvoyActiveHealthChecks=true).

How was OSM installed?: Automated demo

Anything else we need to know?:

Environment:

OSM version (use osm version): Version: dev; Commit: 876bb20a787270e4a1d23d5ea857938300ade579; Date:
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T20:58:09Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

Size of cluster (number of worker nodes in the cluster): 1
Others:

Jul 21 '21 22:07 nojnhuh

I encountered the same issue before. Then I found this:

https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/health_check.proto.html#envoy-v3-api-field-config-core-v3-healthcheck-no-traffic-healthy-interval

When there is no traffic, Envoy performs active HC according to the value of no_traffic_health_interval, which by default is 60s.

I just did a local Envoy test that, if there is traffic between the Envoy and the upstream, the HC will follow health_interval.

I think the books demo might following this scenario, if permissive mode is disabled initially.

This shouldn't be a big issue problem for services that has traffic since the beginning. One problem I can see is that, since active HC has an initial interval of 60s, if within this 60s an endpoint is unhealthy, Envoy will not mark it as unhealthy immediately. It will wait until the next HC.

Oct 11 '21 22:10 allenlsy

Added default label size/needed. Please consider re-labeling this issue appropriately.

Jul 13 '22 00:07 github-actions[bot]

Are we still encountering this bug?

Sep 07 '22 19:09 keithmattix

This issue will be closed due to a long period of inactivity. If you would like this issue to remain open then please comment or update.

Jan 26 '23 00:01 github-actions[bot]

Issue closed due to inactivity.

Feb 02 '23 00:02 github-actions[bot]

osm osm copied to clipboard

Kubernetes probes fail for an extra ~60s when Envoy health checks are enabled

osm
osm copied to clipboard