TestAgentVersionUpgradeToLatest8x is failing
TestAgentVersionUpgradeToLatest8x failed 3 times recently:
- 2022-01-24T02:37:22.702Z | cloud-on-k8s-e2e-tests-snapshot-versions/499
- 2022-02-07T02:27:38.791Z | cloud-on-k8s-e2e-tests-snapshot-versions/509
- 2022-02-08T02:44:08.234Z | cloud-on-k8s-e2e-tests-gke-k8s-versions/649
The cluster health is seen as red during the upgrade from the current stack version to the latest 8.x. The diagnostic shows that the final mutated ES is green.
=== RUN TestAgentVersionUpgradeToLatest8x/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
steps_mutation.go:128:
Error Trace: steps_mutation.go:128
Error: Not equal:
expected: 0
actual : 18
Test: TestAgentVersionUpgradeToLatest8x/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:42:49.130649484 +0000 UTC m=+3063.771492454: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:42:52.129691969 +0000 UTC m=+3066.770534965: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:42:55.120825189 +0000 UTC m=+3069.761668155: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:42:58.122794104 +0000 UTC m=+3072.763637116: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:01.121855902 +0000 UTC m=+3075.762698901: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:04.114444511 +0000 UTC m=+3078.755287507: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:07.129480814 +0000 UTC m=+3081.770323807: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:10.119810916 +0000 UTC m=+3084.760653896: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:13.117716818 +0000 UTC m=+3087.758559821: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:16.116785822 +0000 UTC m=+3090.757628794: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:19.122187574 +0000 UTC m=+3093.763030560: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:22.116400073 +0000 UTC m=+3096.757243068: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:25.12096485 +0000 UTC m=+3099.761807834: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:28.115874777 +0000 UTC m=+3102.756717772: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:31.11987807 +0000 UTC m=+3105.760721076: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:34.122381513 +0000 UTC m=+3108.763224530: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:37.123677675 +0000 UTC m=+3111.764520657: cluster health red
steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:40.138094179 +0000 UTC m=+3114.778937181: cluster health red
{"log.level":"error","@timestamp":"2022-02-08T02:44:08.234Z","message":"stopping early","service.version":"0.0.0-SNAPSHOT+00000000","service.type":"eck","ecs.version":"1.4.0","error":"test failure","error.stack_trace":"github.com/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:44\ngithub.com/elastic/cloud-on-k8s/test/e2e/test.RunMutations\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:39\ngithub.com/elastic/cloud-on-k8s/test/e2e/agent.TestAgentVersionUpgradeToLatest8x\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/upgrade_test.go:62\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1259"}
--- FAIL: TestAgentVersionUpgradeToLatest8x/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (0.00s)
Looking at the container start dates, the failure seems to happen when the second node 1 is restarted.
# e2e-ruqcg-mercury/pods.json
test-agent-upgrade-md9p-es-masterdata-2 started at 2022-02-08T02:38:57Z
test-agent-upgrade-md9p-es-masterdata-1 started at 2022-02-08T02:41:09Z
test-agent-upgrade-md9p-es-masterdata-0 started at 2022-02-08T02:43:21Z
About 1min before the cluster health is reported as red, there is an error logged 3 times ~that indicates the operator used the v6 client to request the status~ (yes, because this is the default implementation shared with the v7 client) and then failed to parse the response. Probably not the root cause but still something worth investigating.
{
"log.level": "error",
"@timestamp": "2022-02-08T02:41:33.348Z",
"log.logger": "elasticsearch-client",
"message": "Cannot parse Elasticsearch error response body",
"service.version": "2.1.0-SNAPSHOT+65999b31",
"service.type": "eck",
"ecs.version": "1.4.0",
"error": "json: cannot unmarshal string into Go struct field ErrorResponse.status of type int",
"error.stack_trace": "github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.newAPIError\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/error.go:46\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*baseClient).doRequest\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/base.go:88\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*baseClient).request\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/base.go:144\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*clientV6).GetClusterHealthWaitForAllEvents\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/v6.go:95\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*memoizingHealth).initialize\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/esstate.go:139\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.initOnce.func1\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/esstate.go:49\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:68\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:59\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.initOnce\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/esstate.go:48\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*memoizingHealth).Health\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/esstate.go:149\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.glob..func5\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade_predicates.go:166\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.runPredicates\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade_pods_deletion.go:192\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.applyPredicates\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade_predicates.go:84\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*rollingUpgradeCtx).Delete\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade_pods_deletion.go:53\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.rollingUpgradeCtx.run\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade.go:161\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).handleRollingUpgrades\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade.go:90\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).reconcileNodeSpecs\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/nodes.go:200\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/driver.go:308\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).internalReconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:285\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"
}
The cluster passed yellow at the beginning of the upgrade because there are indices with unassigned replicas but this is tolerated in this test, so it shouldn't be an issue.
{
"log.level": "info",
"@timestamp": "2022-02-08T02:41:33.348Z",
"log.logger": "driver",
"message": "Cannot restart some nodes for upgrade at this time",
"service.version": "2.1.0-SNAPSHOT+65999b31",
"service.type": "eck",
"ecs.version": "1.4.0",
"namespace": "e2e-ruqcg-mercury",
"es_name": "test-agent-upgrade-md9p",
"failed_predicates": {
"if_yellow_only_restart_upgrading_nodes_with_unassigned_replicas": [
"test-agent-upgrade-md9p-es-masterdata-0"
]
}
}
.ds-metrics-system.cpu-default-2022.02.08-000001 0 r UNASSIGNED
.ds-metrics-elastic_agent.metricbeat-default-2022.02.08-000001 0 r UNASSIGNED
.ds-logs-elastic_agent-default-2022.02.08-000001 0 r UNASSIGNED
.ds-logs-elastic_agent.metricbeat-default-2022.02.08-000001 0 r UNASSIGNED
.fleet-agents-7 0 r UNASSIGNED
.fleet-policies-7 0 r UNASSIGNED
.ds-logs-elastic_agent.filebeat-default-2022.02.08-000001 0 r UNASSIGNED
.ds-.logs-deprecation.elasticsearch-default-2022.02.08-000001 0 r UNASSIGNED
.ds-metrics-system.load-default-2022.02.08-000001 0 r UNASSIGNED
data-integrity-check 2 r UNASSIGNED
data-integrity-check 0 r UNASSIGNED
.ds-metrics-system.socket_summary-default-2022.02.08-000001 0 r UNASSIGNED
.kibana_security_session_1 0 r UNASSIGNED
.kibana_task_manager_7.17.0_001 0 r UNASSIGNED
.fleet-enrollment-api-keys-7 0 r UNASSIGNED
.kibana-event-log-7.17.0-000001 0 r UNASSIGNED
.fleet-servers-7 0 r UNASSIGNED
.apm-agent-configuration 0 r UNASSIGNED
.ds-metrics-system.diskio-default-2022.02.08-000001 0 r UNASSIGNED
.geoip_databases 0 r UNASSIGNED
.apm-custom-link 0 r UNASSIGNED
.ds-metrics-system.uptime-default-2022.02.08-000001 0 r UNASSIGNED
.ds-metrics-system.process-default-2022.02.08-000001 0 r UNASSIGNED
.ds-metrics-system.memory-default-2022.02.08-000001 0 r UNASSIGNED
I don't know how it can be red. I ran it locally a couple of times and couldn't reproduce it.
During the upgrade, I see in the operator log that it gets some 408 while retrieving the _cluster/health and fails to parse the response body. I printed it and it's in base64.
2022-02-09T17:50:13.700+0100 ERROR elasticsearch-client Cannot parse Elasticsearch error response body {"service.version": "2.1.0-SNAPSHOT+cfa61fb2", "body": "eyJjbHVzdGVyX25hbWUiOiJ0ZXN0LWFnZW50LXVwZ3JhZGUtbnZxOCIsInN0YXR1cyI6InllbGxvdyIsInRpbWVkX291dCI6dHJ1ZSwibnVtYmVyX29mX25vZGVzIjozLCJudW1iZXJfb2ZfZGF0YV9ub2Rlcy
I6MywiYWN0aXZlX3ByaW1hcnlfc2hhcmRzIjozNSwiYWN0aXZlX3NoYXJkcyI6NjQsInJlbG9jYXRpbmdfc2hhcmRzIjowLCJpbml0aWFsaXppbmdfc2hhcmRzIjoyLCJ1bmFzc2lnbmVkX3NoYXJkcyI6NCwiZGVsYXllZF91bmFzc2lnbmVkX3NoYXJkcyI6NCwibnVtYmVyX29mX3BlbmRpbmdfdGFza3MiOjEsIm51bWJlcl9vZl9pbl9mbGlnaHRfZmV0Y2giOjAsInRhc2tfbWF4X3dhaXRpbmdfaW5fcXVldWVfbWlsbGlzIjowLCJh
Y3RpdmVfc2hhcmRzX3BlcmNlbnRfYXNfbnVtYmVyIjo5MS40Mjg1NzE0Mjg1NzE0M30=", "error": "json: cannot unmarshal string into Go struct field ErrorResponse.status of type int"}
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.newAPIError
/home/thb/dev/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/error.go:46
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*baseClient).doRequest
/home/thb/dev/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/base.go:90
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*baseClient).request
/home/thb/dev/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/base.go:146
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*clientV6).GetClusterHealthWaitForAllEvents
/home/thb/dev/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/v6.go:95
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*memoizingHealth).initialize
/home/thb/dev/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/esstate.go:139
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.initOnce.func1
/home/thb/dev/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/esstate.go:49
sync.(*Once).doSlow
/home/thb/dev/tools/go/src/sync/once.go:68
sync.(*Once).Do
/home/thb/dev/tools/go/src/sync/once.go:59
echo 'eyJjbHVzdGVyX25hbWUiOiJ0ZXN0LWFnZW50LXVwZ3JhZGUtbnZxOCIsInN0YXR1cyI6InllbGxvdyIsInRpbWVkX291dCI6dHJ1ZSwibnVtYmVyX29mX25vZGVzIjozLCJudW1iZXJfb2ZfZGF0YV9ub2Rlcy
I6MywiYWN0aXZlX3ByaW1hcnlfc2hhcmRzIjozNSwiYWN0aXZlX3NoYXJkcyI6NjQsInJlbG9jYXRpbmdfc2hhcmRzIjowLCJpbml0aWFsaXppbmdfc2hhcmRzIjoyLCJ1bmFzc2lnbmVkX3NoYXJkcyI6NCwiZGVsYXllZF91bmFzc2lnbmVkX3NoYXJkcyI6NCwibnVtYmVyX29mX3BlbmRpbmdfdGFza3MiOjEsIm51bWJlcl9vZl9pbl9mbGlnaHRfZmV0Y2giOjAsInRhc2tfbWF4X3dhaXRpbmdfaW5fcXVldWVfbWlsbGlzIjowLCJh
Y3RpdmVfc2hhcmRzX3BlcmNlbnRfYXNfbnVtYmVyIjo5MS40Mjg1NzE0Mjg1NzE0M30=' | base64 -d | jq .
{
"cluster_name": "test-agent-upgrade-nvq8",
"status": "yellow",
"timed_out": true,
"number_of_nodes": 3,
"number_of_data_nodes": 3,
"active_primary_shards": 35,
"active_shards": 64,
"relocating_shards": 0,
"initializing_shards": 2,
"unassigned_shards": 4,
"delayed_unassigned_shards": 4,
"number_of_pending_tasks": 1,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 91.42857142857143
}
Last comment was me being stupid. I forgot to convert the []byte body to a string when I logged it so the logging framework printed the []byte in base64. It accepts an interface{} and does some conversions like this for some types.
For the cannot unmarshal string error, by default we try to parse the ES error response using the ErrorResponse struct and we expect to get an error for 408 on /_cluster/health which returns a normal body.
The only thing to retain is that the operator gets a few 408s when requesting /_cluster/health during the upgrade, which is expected:
https://github.com/elastic/cloud-on-k8s/blob/780831adab3c88e57666d7a1659a589a4a2b5d70/test/e2e/test/elasticsearch/steps_mutation.go#L226-L236
Ok so this is not the exact same test but TestVersionUpgradeToLatest8x/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process but I think it might be the same problem. I am trying to summarise what I saw in a new issue which actually might be an Elasticsearch issue https://github.com/elastic/cloud-on-k8s/issues/5795
Closing because similar to #5795.