cloud-on-k8s icon indicating copy to clipboard operation
cloud-on-k8s copied to clipboard

TestMutationSecondMasterSetDown is flaky

Open barkbay opened this issue 2 years ago • 4 comments

TestMutationSecondMasterSetDown failed 4 times today:

=== RUN   TestMutationSecondMasterSetDown/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
    steps_mutation.go:150: Elasticsearch cluster health check failure at 2022-07-06 18:12:25.389331963 +0000 UTC m=+5386.055151684: elasticsearch client failed for https://test-mutation-2nd-master-set-mqcr-es-internal-http.e2e-f79bs-mercury.svc:9200/_cluster/health: Get "[https://test-mutation-2nd-master-set-mqcr-es-internal-http.e2e-f79bs-mercury.svc:9200/_cluster/health":](https://test-mutation-2nd-master-set-mqcr-es-internal-http.e2e-f79bs-mercury.svc:9200/_cluster/health) context deadline exceeded
    steps_mutation.go:152: 
        	Error Trace:	steps_mutation.go:152
        	Error:      	Not equal: 
        	            	expected: 0
        	            	actual  : 1
        	Test:       	TestMutationSecondMasterSetDown/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process

barkbay avatar Jul 07 '22 07:07 barkbay

It also failed yesterday https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-main/384/testReport/ The most notable recent change is #5844

barkbay avatar Jul 07 '22 07:07 barkbay

I took a look today and tried to figure out why this is failing. Just dropping my raw notes here for this failed run

https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-gke-k8s-versions/771//testReport

  • Jul 8, 2022 @ 04:57:07.135 master-1 shutdown requested
  • Jul 8, 2022 @ 04:57:08.866 Could not verify license, re-queuing: elasticsearch client failed for https://test-mutation-2nd-master-set-dsw2-es-internal-http.e2e-0gspi-mercury.svc:9200/_license: Get "https://test-mutation-2nd-master-set-dsw2-es-internal-http.e2e-0gspi-mercury.svc:9200/_license": dial tcp 10.48.24.6:9200: connect: connection refused
  • Jul 8, 2022 @ 04:57:19.887 operator deletes master-1 PVC i.e. has observed deletion of pod
  • Jul 8, 2022 @ 04:57:19.795 ES Log: nodes: {test-mutation-2nd-master-set-dsw2-es-masterdata-1}{_aB8czzjRum7U6GOficTzw}{CF3GtulDS6OSZaPqKdwoaA}{test-mutation-2nd-master-set-dsw2-es-masterdata-1}{10.48.34.94}{10.48.34.94:9300}{cdfhilmrstw}{k8s_node_name=gke-eck-gke20-771-e2e-default-pool-a3190d42-kmns, ml.machine_memory=2147483648, xpack.installed=true, ml.max_jvm_size=1073741824} {test-mutation-2nd-master-set-dsw2-es-master-0}{CnnhcSE2TjSpBCUDKKTEfg}{AkmaTK-NQXm8wjMhNEUyGA}{test-mutation-2nd-master-set-dsw2-es-master-0}{10.48.32.114}{10.48.32.114:9300}{m}{xpack.installed=true, k8s_node_name=gke-eck-gke20-771-e2e-default-pool-3efcb5bd-cvv0}, local, master {test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw}{k8s_node_name=gke-eck-gke20-771-e2e-default-pool-f72a6432-v926, ml.machine_memory=2147483648, xpack.installed=true, ml.max_jvm_size=1073741824}
  • Jul 8, 2022 @ 04:57:20.026 master-0 shutdown request
  • Jul 8, 2022 @ 04:57:20.157 shutdown complete operator scales down stateful set
  • Jul 8, 2022 @ 04:57:20.478 master node changed {previous [{test-mutation-2nd-master-set-dsw2-es-master-0}{CnnhcSE2TjSpBCUDKKTEfg}{AkmaTK-NQXm8wjMhNEUyGA}{test-mutation-2nd-master-set-dsw2-es-master-0}{10.48.32.114}{10.48.32.114:9300}{m}], current []}, term: 3, version: 68, reason: becoming
  • Jul 8, 2022 @ 04:57:20.665 failing [elected-as-master ([3] nodes joined)[FINISH_ELECTION, {test-mutation-2nd-master-set-dsw2-es-masterdata-1}{_aB8czzjRum7U6GOficTzw}{CF3GtulDS6OSZaPqKdwoaA}{test-mutation-2nd-master-set-dsw2-es-masterdata-1}{10.48.34.94}{10.48.34.94:9300}{cdfhilmrstw} completing election, {test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw} completing election, {test-mutation-2nd-master-set-dsw2-es-master-0}{CnnhcSE2TjSpBCUDKKTEfg}{AkmaTK-NQXm8wjMhNEUyGA}{test-mutation-2nd-master-set-dsw2-es-master-0}{10.48.32.114}{10.48.32.114:9300}{m} completing election]]: failed to commit cluster state version [69]
  • Jul 8, 2022 @ 04:57:20.996 failing [elected-as-master ([2] nodes joined)[FINISH_ELECTION, {test-mutation-2nd-master-set-dsw2-es-masterdata-1}{_aB8czzjRum7U6GOficTzw}{CF3GtulDS6OSZaPqKdwoaA}{test-mutation-2nd-master-set-dsw2-es-masterdata-1}{10.48.34.94}{10.48.34.94:9300}{cdfhilmrstw} completing election, {test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw} completing election]]: failed to commit cluster state version [70]
  • Jul 8, 2022 @ 04:57:21.199 master node changed {previous [], current [{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw}]}, term: 8, version: 70, reason: ApplyCommitRequest{term=8, version=70, sourceNode={test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw}{k8s_node_name=gke-eck-gke20-771-e2e-default-pool-f72a6432-v926, ml.machine_memory=2147483648, xpack.installed=true, ml.max_jvm_size=1073741824}}
  • Jul 8, 2022 @ 04:57:42.066 test-mutation-2nd-master-set-dsw2-es-master-0 handshake to [10.48.33.145:9300] failed
  • Jul 8, 2022 @ 04:57:43.715 test-mutation-2nd-master-set-dsw2-es-master-0 readiness probe failed
  • Jul 8, 2022 @ 04:57:47.064 test-mutation-2nd-master-set-dsw2-es-master-0 handshake to [10.48.33.145:9300] failed (this is same time as failed e2e health check, did we hit maybe master-0)
  • Jul 8, 2022 @ 04:57:43.030 test-mutation-2nd-master-set-dsw2-es-master-0 : master not discovered or elected yet, an election requires a node with id [_aB8czzjRum7U6GOficTzw], have only discovered non-quorum [{test-mutation-2nd-master-set-dsw2-es-master-0}{CnnhcSE2TjSpBCUDKKTEfg}{AkmaTK-NQXm8wjMhNEUyGA}{test-mutation-2nd-master-set-dsw2-es-master-0}{10.48.32.114}{10.48.32.114:9300}{m}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, 10.48.33.145:9300, 10.48.33.146:9300, 10.48.34.94:9300] from hosts providers and [{test-mutation-2nd-master-set-dsw2-es-master-0}{CnnhcSE2TjSpBCUDKKTEfg}{AkmaTK-NQXm8wjMhNEUyGA}{test-mutation-2nd-master-set-dsw2-es-master-0}{10.48.32.114}{10.48.32.114:9300}{m}, {test-mutation-2nd-master-set-dsw2-es-masterdata-1}{_aB8czzjRum7U6GOficTzw}{CF3GtulDS6OSZaPqKdwoaA}{test-mutation-2nd-master-set-dsw2-es-masterdata-1}{10.48.34.94}{10.48.34.94:9300}{cdfhilmrstw}, {test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw}] from last-known cluster state; node term 8, last-accepted version 70 in term 8; joining [{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw}] in term [8] has status [waiting for local cluster applier] after [22s/22010ms]
  • Jul 8, 2022 @ 04:57:50.567 nodes: {test-mutation-2nd-master-set-dsw2-es-masterdata-0}{ugdXiHyqSbSVKyAzldk3kw}{JvsuytbxQoyPwNuCNZR4NQ}{test-mutation-2nd-master-set-dsw2-es-masterdata-0}{10.48.33.145}{10.48.33.145:9300}{cdfhilmrstw}{k8s_node_name=gke-eck-gke20-771-e2e-default-pool-f72a6432-v926, xpack.installed=true, ml.max_jvm_size=1073741824, ml.machine_memory=2147483648}, local, master {test-mutation-2nd-master-set-dsw2-es-masterdata-1}{_aB8czzjRum7U6GOficTzw}{CF3GtulDS6OSZaPqKdwoaA}{test-mutation-2nd-master-set-dsw2-es-masterdata-1}{10.48.34.94}{10.48.34.94:9300}{cdfhilmrstw}{k8s_node_name=gke-eck-gke20-771-e2e-default-pool-a3190d42-kmns, ml.machine_memory=2147483648, xpack.installed=true, ml.max_jvm_size=1073741824}
  • Jul 8, 2022 @ 04:57:54.734, Pruning keys from certificates secret est-mutation-2nd-master-set-dsw2-es-master-0.tls.crt, test-mutation-2nd-master-set-dsw2-es-master-0.tls.key (operator has observed deletion at this point)
  • "2022-07-08T04:58:02.101Z" : Elasticsearch cluster health check failure at 2022-07-08 04:57:47.352919486 +0000 UTC m=+5352.139371606: elasticsearch client failed for https://test-mutation-2nd-master-set-dsw2-es-internal-http.e2e-0gspi-mercury.svc:9200/_cluster/health: Get "https://test-mutation-2nd-master-set-dsw2-es-internal-http.e2e-0gspi-mercury.svc:9200/_cluster/health": context deadline exceeded

If we compare that with a succesful run what I see is that on a successful run the master node was already the only surviving node for this test case i.e. masterdata-0. That means that during the scale down of the other nodes there was no new master election necessary.

My impression is that connection error happens when the current master is being shutdown in the failed run. I am not familiar enough with the details of how master elections work in Elasticsearch but it seems that a first election attempts somehow fail (I am uncertain how significant that is) but a second later we have a new master node.

What does not quite add up for me is that the failed health check is much later. ~20 seconds after the new master is elected. The master-0 Pod still exists at that point and I am wondering if we hit the decomissioned Pod which does not have a master anymore and is no longer part of the cluster (see the handshake failures). I was always thinking that a Pod is removed from the service endpoints once the deletion timestamp is set. But obviously there might be delays until this propagates through the k8s cluster.

Finally, there is another failed connection attempt (04:57:08.866 ) that the operator sees much earlier that the e2e test health check misses.

pebrc avatar Jul 08 '22 16:07 pebrc

The request tracing added in https://github.com/elastic/cloud-on-k8s/pull/5869 gives some additional clues.

Based on a recent test failure

 Elasticsearch cluster health check failure at 2022-07-12 03:38:17.428219659 +0000 UTC m=+3641.730346691: elasticsearch client failed for https://test-mutation-2nd-master-set-jdjv-es-internal-http.e2e-5zf22-mercury.svc:9200/_cluster/health: Get "https://test-mutation-2nd-master-set-jdjv-es-internal-http.e2e-5zf22-mercury.svc:9200/_cluster/health": context deadline exceeded

Corresponding request in the Elasticsearch logs: Screenshot 2022-07-12 at 10 00 07

Observations:

  1. we are not hitting the node shutting down (master-1) but actually the node that will eventually be the only surviving one. So my theory was wrong (good!)
  2. 10.244.3.3 is the operator Pod so this request is definitely coming from the e2e test Pod

Matching response:

Screenshot 2022-07-12 at 09 32 02

So this is .226 seconds short of 5 seconds to reply. 5 seconds is our timeout for the e2e test continuous health check. So my new theory is that by the time the other side sees this request we have exceeded the 5 second timeout and thus fail the test.

But this of course happens on top of 20 seconds of unavailablity that we allow for anyway during a rolling upgrade.

What is interesting when looking at some the other requests leading up to this failure is that we seem to be getting some BAD REQUEST on _cluster/health requests back:

[23][null][BAD_REQUEST][application/json;charset=utf-8][351] sent response to [Netty4HttpChannel{localAddress=/10.244.3.206:9200, remoteAddress=/10.244.2.3:46366}] success [true]

pebrc avatar Jul 12 '22 07:07 pebrc

The 400 BAD REQUEST log entries happen if the client cancels the request before the response has been calculated. Cancelled tasks that produce HTTP results in Elasticsearch default then to 400 apparently. So this was a bit of a red herring.

It remains the fact that we see longer unvailability during rolling upgrades in 8.3 that is still not fully understood. After a conversation with David Turner I don't have any new clues as to why that might be the case. The only contributing factor mentioned was that cluster state updates are faster now but I need to follow up with him as to why that might lead to longer unavailability.

pebrc avatar Jul 14 '22 08:07 pebrc

This has not happened in the last 6 months and we are a couple of minors ahead of 8.3 by now. I am closing this for the moment we can reopen if this becomes an issue again.

pebrc avatar Jan 19 '23 09:01 pebrc