cloud-on-k8s
cloud-on-k8s copied to clipboard
TestMutationSecondMasterSetDown is flaky
TestMutationSecondMasterSetDown
failed 4 times today:
-
cloud-on-k8s-e2e-tests-gke-k8s-versions on GKE
1.20
and1.22
-
cloud-on-k8s-e2e-tests-kind-k8s-versions on Kind
1.23.6 IPV6
- cloud-on-k8s-e2e-tests-main
=== RUN TestMutationSecondMasterSetDown/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
steps_mutation.go:150: Elasticsearch cluster health check failure at 2022-07-06 18:12:25.389331963 +0000 UTC m=+5386.055151684: elasticsearch client failed for https://test-mutation-2nd-master-set-mqcr-es-internal-http.e2e-f79bs-mercury.svc:9200/_cluster/health: Get "[https://test-mutation-2nd-master-set-mqcr-es-internal-http.e2e-f79bs-mercury.svc:9200/_cluster/health":](https://test-mutation-2nd-master-set-mqcr-es-internal-http.e2e-f79bs-mercury.svc:9200/_cluster/health) context deadline exceeded
steps_mutation.go:152:
Error Trace: steps_mutation.go:152
Error: Not equal:
expected: 0
actual : 1
Test: TestMutationSecondMasterSetDown/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
It also failed yesterday https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-main/384/testReport/ The most notable recent change is #5844
I took a look today and tried to figure out why this is failing. Just dropping my raw notes here for this failed run
https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-gke-k8s-versions/771//testReport
- Jul 8, 2022 @ 04:57:07.135 master-1 shutdown requested
- Jul 8, 2022 @ 04:57:08.866 Could not verify license, re-queuing: elasticsearch client failed for https://test-mutation-2nd-master-set-dsw2-es-internal-http.e2e-0gspi-mercury.svc:9200/_license: Get "https://test-mutation-2nd-master-set-dsw2-es-internal-http.e2e-0gspi-mercury.svc:9200/_license": dial tcp 10.48.24.6:9200: connect: connection refused
- Jul 8, 2022 @ 04:57:19.887 operator deletes master-1 PVC i.e. has observed deletion of pod
- Jul 8, 2022 @ 04:57:19.795 ES Log: nodes: {test-mutation-2nd-master-set-dsw2-es-masterdata-1}{_aB8czzjRum7U6GOficTzw}{CF3GtulDS6OSZaPqKdwoaA}{test-mutation-2nd-master-set-dsw2-es-masterdata-1}{10.48.34.94}{10.48.34.94:9300}{cdfhilmrstw}{k8s_node_name=gke-eck-gke20-771-e2e-default-pool-a3190d42-kmns, ml.machine_memory=2147483648, xpack.installed=true, ml.max_jvm_size=1073741824} {test-mutation-2nd-master-set-dsw2-es-master-0}{CnnhcSE2TjSpBCUDKKTEfg}{AkmaTK-NQXm8wjMhNEUyGA}{test-mutation-2nd-master-set-dsw2-es-master-0}{10.48.32.114}{10.48.32.114:9300}{m}{xpack.installed=true, k8s_node_name=gke-eck-gke20-771-e2e-default-pool-3efcb5bd-cvv0}, local, master {test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw}{k8s_node_name=gke-eck-gke20-771-e2e-default-pool-f72a6432-v926, ml.machine_memory=2147483648, xpack.installed=true, ml.max_jvm_size=1073741824}
- Jul 8, 2022 @ 04:57:20.026 master-0 shutdown request
- Jul 8, 2022 @ 04:57:20.157 shutdown complete operator scales down stateful set
- Jul 8, 2022 @ 04:57:20.478 master node changed {previous [{test-mutation-2nd-master-set-dsw2-es-master-0}{CnnhcSE2TjSpBCUDKKTEfg}{AkmaTK-NQXm8wjMhNEUyGA}{test-mutation-2nd-master-set-dsw2-es-master-0}{10.48.32.114}{10.48.32.114:9300}{m}], current []}, term: 3, version: 68, reason: becoming
- Jul 8, 2022 @ 04:57:20.665 failing [elected-as-master ([3] nodes joined)[FINISH_ELECTION, {test-mutation-2nd-master-set-dsw2-es-masterdata-1}{_aB8czzjRum7U6GOficTzw}{CF3GtulDS6OSZaPqKdwoaA}{test-mutation-2nd-master-set-dsw2-es-masterdata-1}{10.48.34.94}{10.48.34.94:9300}{cdfhilmrstw} completing election, {test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw} completing election, {test-mutation-2nd-master-set-dsw2-es-master-0}{CnnhcSE2TjSpBCUDKKTEfg}{AkmaTK-NQXm8wjMhNEUyGA}{test-mutation-2nd-master-set-dsw2-es-master-0}{10.48.32.114}{10.48.32.114:9300}{m} completing election]]: failed to commit cluster state version [69]
- Jul 8, 2022 @ 04:57:20.996 failing [elected-as-master ([2] nodes joined)[FINISH_ELECTION, {test-mutation-2nd-master-set-dsw2-es-masterdata-1}{_aB8czzjRum7U6GOficTzw}{CF3GtulDS6OSZaPqKdwoaA}{test-mutation-2nd-master-set-dsw2-es-masterdata-1}{10.48.34.94}{10.48.34.94:9300}{cdfhilmrstw} completing election, {test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw} completing election]]: failed to commit cluster state version [70]
- Jul 8, 2022 @ 04:57:21.199 master node changed {previous [], current [{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw}]}, term: 8, version: 70, reason: ApplyCommitRequest{term=8, version=70, sourceNode={test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw}{k8s_node_name=gke-eck-gke20-771-e2e-default-pool-f72a6432-v926, ml.machine_memory=2147483648, xpack.installed=true, ml.max_jvm_size=1073741824}}
- Jul 8, 2022 @ 04:57:42.066 test-mutation-2nd-master-set-dsw2-es-master-0 handshake to [10.48.33.145:9300] failed
- Jul 8, 2022 @ 04:57:43.715 test-mutation-2nd-master-set-dsw2-es-master-0 readiness probe failed
- Jul 8, 2022 @ 04:57:47.064 test-mutation-2nd-master-set-dsw2-es-master-0 handshake to [10.48.33.145:9300] failed (this is same time as failed e2e health check, did we hit maybe master-0)
- Jul 8, 2022 @ 04:57:43.030 test-mutation-2nd-master-set-dsw2-es-master-0 : master not discovered or elected yet, an election requires a node with id [_aB8czzjRum7U6GOficTzw], have only discovered non-quorum [{test-mutation-2nd-master-set-dsw2-es-master-0}{CnnhcSE2TjSpBCUDKKTEfg}{AkmaTK-NQXm8wjMhNEUyGA}{test-mutation-2nd-master-set-dsw2-es-master-0}{10.48.32.114}{10.48.32.114:9300}{m}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, 10.48.33.145:9300, 10.48.33.146:9300, 10.48.34.94:9300] from hosts providers and [{test-mutation-2nd-master-set-dsw2-es-master-0}{CnnhcSE2TjSpBCUDKKTEfg}{AkmaTK-NQXm8wjMhNEUyGA}{test-mutation-2nd-master-set-dsw2-es-master-0}{10.48.32.114}{10.48.32.114:9300}{m}, {test-mutation-2nd-master-set-dsw2-es-masterdata-1}{_aB8czzjRum7U6GOficTzw}{CF3GtulDS6OSZaPqKdwoaA}{test-mutation-2nd-master-set-dsw2-es-masterdata-1}{10.48.34.94}{10.48.34.94:9300}{cdfhilmrstw}, {test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw}] from last-known cluster state; node term 8, last-accepted version 70 in term 8; joining [{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw}] in term [8] has status [waiting for local cluster applier] after [22s/22010ms]
- Jul 8, 2022 @ 04:57:50.567 nodes: {test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw}{k8s_node_name=gke-eck-gke20-771-e2e-default-pool-f72a6432-v926, xpack.installed=true, ml.max_jvm_size=1073741824, ml.machine_memory=2147483648}, local, master {test-mutation-2nd-master-set-dsw2-es-masterdata-1}{_aB8czzjRum7U6GOficTzw}{CF3GtulDS6OSZaPqKdwoaA}{test-mutation-2nd-master-set-dsw2-es-masterdata-1}{10.48.34.94}{10.48.34.94:9300}{cdfhilmrstw}{k8s_node_name=gke-eck-gke20-771-e2e-default-pool-a3190d42-kmns, ml.machine_memory=2147483648, xpack.installed=true, ml.max_jvm_size=1073741824}
- Jul 8, 2022 @ 04:57:54.734, Pruning keys from certificates secret est-mutation-2nd-master-set-dsw2-es-master-0.tls.crt, test-mutation-2nd-master-set-dsw2-es-master-0.tls.key (operator has observed deletion at this point)
- "2022-07-08T04:58:02.101Z" : Elasticsearch cluster health check failure at 2022-07-08 04:57:47.352919486 +0000 UTC m=+5352.139371606: elasticsearch client failed for https://test-mutation-2nd-master-set-dsw2-es-internal-http.e2e-0gspi-mercury.svc:9200/_cluster/health: Get "https://test-mutation-2nd-master-set-dsw2-es-internal-http.e2e-0gspi-mercury.svc:9200/_cluster/health": context deadline exceeded
If we compare that with a succesful run what I see is that on a successful run the master node was already the only surviving node for this test case i.e. masterdata-0
. That means that during the scale down of the other nodes there was no new master election necessary.
My impression is that connection error happens when the current master is being shutdown in the failed run. I am not familiar enough with the details of how master elections work in Elasticsearch but it seems that a first election attempts somehow fail (I am uncertain how significant that is) but a second later we have a new master node.
What does not quite add up for me is that the failed health check is much later. ~20 seconds after the new master is elected. The master-0
Pod still exists at that point and I am wondering if we hit the decomissioned Pod which does not have a master anymore and is no longer part of the cluster (see the handshake failures). I was always thinking that a Pod is removed from the service endpoints once the deletion timestamp is set. But obviously there might be delays until this propagates through the k8s cluster.
Finally, there is another failed connection attempt (04:57:08.866 ) that the operator sees much earlier that the e2e test health check misses.
The request tracing added in https://github.com/elastic/cloud-on-k8s/pull/5869 gives some additional clues.
Based on a recent test failure
Elasticsearch cluster health check failure at 2022-07-12 03:38:17.428219659 +0000 UTC m=+3641.730346691: elasticsearch client failed for https://test-mutation-2nd-master-set-jdjv-es-internal-http.e2e-5zf22-mercury.svc:9200/_cluster/health: Get "https://test-mutation-2nd-master-set-jdjv-es-internal-http.e2e-5zf22-mercury.svc:9200/_cluster/health": context deadline exceeded
Corresponding request in the Elasticsearch logs:
Observations:
- we are not hitting the node shutting down (
master-1
) but actually the node that will eventually be the only surviving one. So my theory was wrong (good!) -
10.244.3.3
is the operator Pod so this request is definitely coming from the e2e test Pod
Matching response:
So this is .226 seconds short of 5 seconds to reply. 5 seconds is our timeout for the e2e test continuous health check. So my new theory is that by the time the other side sees this request we have exceeded the 5 second timeout and thus fail the test.
But this of course happens on top of 20 seconds of unavailablity that we allow for anyway during a rolling upgrade.
What is interesting when looking at some the other requests leading up to this failure is that we seem to be getting some BAD REQUEST
on _cluster/health
requests back:
[23][null][BAD_REQUEST][application/json;charset=utf-8][351] sent response to [Netty4HttpChannel{localAddress=/10.244.3.206:9200, remoteAddress=/10.244.2.3:46366}] success [true]
The 400 BAD REQUEST log entries happen if the client cancels the request before the response has been calculated. Cancelled tasks that produce HTTP results in Elasticsearch default then to 400 apparently. So this was a bit of a red herring.
It remains the fact that we see longer unvailability during rolling upgrades in 8.3 that is still not fully understood. After a conversation with David Turner I don't have any new clues as to why that might be the case. The only contributing factor mentioned was that cluster state updates are faster now but I need to follow up with him as to why that might lead to longer unavailability.
This has not happened in the last 6 months and we are a couple of minors ahead of 8.3 by now. I am closing this for the moment we can reopen if this becomes an issue again.