cloud-on-k8s TestAgentVersionUpgradeToLatest8x is failing

TestAgentVersionUpgradeToLatest8x failed 3 times recently:

2022-01-24T02:37:22.702Z | cloud-on-k8s-e2e-tests-snapshot-versions/499
2022-02-07T02:27:38.791Z | cloud-on-k8s-e2e-tests-snapshot-versions/509
2022-02-08T02:44:08.234Z | cloud-on-k8s-e2e-tests-gke-k8s-versions/649

The cluster health is seen as red during the upgrade from the current stack version to the latest 8.x. The diagnostic shows that the final mutated ES is green.

=== RUN   TestAgentVersionUpgradeToLatest8x/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
    steps_mutation.go:128: 
        	Error Trace:	steps_mutation.go:128
        	Error:      	Not equal: 
        	            	expected: 0
        	            	actual  : 18
        	Test:       	TestAgentVersionUpgradeToLatest8x/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:42:49.130649484 +0000 UTC m=+3063.771492454: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:42:52.129691969 +0000 UTC m=+3066.770534965: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:42:55.120825189 +0000 UTC m=+3069.761668155: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:42:58.122794104 +0000 UTC m=+3072.763637116: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:01.121855902 +0000 UTC m=+3075.762698901: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:04.114444511 +0000 UTC m=+3078.755287507: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:07.129480814 +0000 UTC m=+3081.770323807: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:10.119810916 +0000 UTC m=+3084.760653896: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:13.117716818 +0000 UTC m=+3087.758559821: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:16.116785822 +0000 UTC m=+3090.757628794: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:19.122187574 +0000 UTC m=+3093.763030560: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:22.116400073 +0000 UTC m=+3096.757243068: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:25.12096485 +0000 UTC m=+3099.761807834: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:28.115874777 +0000 UTC m=+3102.756717772: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:31.11987807 +0000 UTC m=+3105.760721076: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:34.122381513 +0000 UTC m=+3108.763224530: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:37.123677675 +0000 UTC m=+3111.764520657: cluster health red
    steps_mutation.go:130: Elasticsearch cluster health check failure at 2022-02-08 02:43:40.138094179 +0000 UTC m=+3114.778937181: cluster health red
{"log.level":"error","@timestamp":"2022-02-08T02:44:08.234Z","message":"stopping early","service.version":"0.0.0-SNAPSHOT+00000000","service.type":"eck","ecs.version":"1.4.0","error":"test failure","error.stack_trace":"github.com/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:44\ngithub.com/elastic/cloud-on-k8s/test/e2e/test.RunMutations\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:39\ngithub.com/elastic/cloud-on-k8s/test/e2e/agent.TestAgentVersionUpgradeToLatest8x\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/upgrade_test.go:62\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1259"}
    --- FAIL: TestAgentVersionUpgradeToLatest8x/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (0.00s)

Looking at the container start dates, the failure seems to happen when the second node 1 is restarted.

# e2e-ruqcg-mercury/pods.json
test-agent-upgrade-md9p-es-masterdata-2  started at 2022-02-08T02:38:57Z
test-agent-upgrade-md9p-es-masterdata-1  started at 2022-02-08T02:41:09Z
test-agent-upgrade-md9p-es-masterdata-0  started at 2022-02-08T02:43:21Z

About 1min before the cluster health is reported as red, there is an error logged 3 times ~that indicates the operator used the v6 client to request the status~ (yes, because this is the default implementation shared with the v7 client) and then failed to parse the response. Probably not the root cause but still something worth investigating.

{
    "log.level": "error",
    "@timestamp": "2022-02-08T02:41:33.348Z",
    "log.logger": "elasticsearch-client",
    "message": "Cannot parse Elasticsearch error response body",
    "service.version": "2.1.0-SNAPSHOT+65999b31",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "error": "json: cannot unmarshal string into Go struct field ErrorResponse.status of type int",
    "error.stack_trace": "github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.newAPIError\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/error.go:46\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*baseClient).doRequest\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/base.go:88\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*baseClient).request\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/base.go:144\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*clientV6).GetClusterHealthWaitForAllEvents\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/v6.go:95\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*memoizingHealth).initialize\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/esstate.go:139\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.initOnce.func1\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/esstate.go:49\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:68\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:59\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.initOnce\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/esstate.go:48\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*memoizingHealth).Health\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/esstate.go:149\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.glob..func5\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade_predicates.go:166\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.runPredicates\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade_pods_deletion.go:192\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.applyPredicates\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade_predicates.go:84\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*rollingUpgradeCtx).Delete\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade_pods_deletion.go:53\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.rollingUpgradeCtx.run\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade.go:161\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).handleRollingUpgrades\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade.go:90\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).reconcileNodeSpecs\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/nodes.go:200\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/driver.go:308\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).internalReconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:285\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:184\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"
}

The cluster passed yellow at the beginning of the upgrade because there are indices with unassigned replicas but this is tolerated in this test, so it shouldn't be an issue.

{
    "log.level": "info",
    "@timestamp": "2022-02-08T02:41:33.348Z",
    "log.logger": "driver",
    "message": "Cannot restart some nodes for upgrade at this time",
    "service.version": "2.1.0-SNAPSHOT+65999b31",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "namespace": "e2e-ruqcg-mercury",
    "es_name": "test-agent-upgrade-md9p",
    "failed_predicates": {
        "if_yellow_only_restart_upgrading_nodes_with_unassigned_replicas": [
            "test-agent-upgrade-md9p-es-masterdata-0"
        ]
    }
}

.ds-metrics-system.cpu-default-2022.02.08-000001                  0 r UNASSIGNED                            
.ds-metrics-elastic_agent.metricbeat-default-2022.02.08-000001    0 r UNASSIGNED                            
.ds-logs-elastic_agent-default-2022.02.08-000001                  0 r UNASSIGNED                            
.ds-logs-elastic_agent.metricbeat-default-2022.02.08-000001       0 r UNASSIGNED                            
.fleet-agents-7                                                   0 r UNASSIGNED                            
.fleet-policies-7                                                 0 r UNASSIGNED                            
.ds-logs-elastic_agent.filebeat-default-2022.02.08-000001         0 r UNASSIGNED                            
.ds-.logs-deprecation.elasticsearch-default-2022.02.08-000001     0 r UNASSIGNED                            
.ds-metrics-system.load-default-2022.02.08-000001                 0 r UNASSIGNED                            
data-integrity-check                                              2 r UNASSIGNED                            
data-integrity-check                                              0 r UNASSIGNED                            
.ds-metrics-system.socket_summary-default-2022.02.08-000001       0 r UNASSIGNED                            
.kibana_security_session_1                                        0 r UNASSIGNED                            
.kibana_task_manager_7.17.0_001                                   0 r UNASSIGNED                            
.fleet-enrollment-api-keys-7                                      0 r UNASSIGNED                            
.kibana-event-log-7.17.0-000001                                   0 r UNASSIGNED                            
.fleet-servers-7                                                  0 r UNASSIGNED                            
.apm-agent-configuration                                          0 r UNASSIGNED                            
.ds-metrics-system.diskio-default-2022.02.08-000001               0 r UNASSIGNED                            
.geoip_databases                                                  0 r UNASSIGNED                            
.apm-custom-link                                                  0 r UNASSIGNED                            
.ds-metrics-system.uptime-default-2022.02.08-000001               0 r UNASSIGNED                            
.ds-metrics-system.process-default-2022.02.08-000001              0 r UNASSIGNED                            
.ds-metrics-system.memory-default-2022.02.08-000001               0 r UNASSIGNED

I don't know how it can be red. I ran it locally a couple of times and couldn't reproduce it.

Feb 08 '22 11:02 thbkrkr

During the upgrade, I see in the operator log that it gets some 408 while retrieving the _cluster/health and fails to parse the response body. I printed it and it's in base64.

2022-02-09T17:50:13.700+0100    ERROR   elasticsearch-client    Cannot parse Elasticsearch error response body  {"service.version": "2.1.0-SNAPSHOT+cfa61fb2", "body": "eyJjbHVzdGVyX25hbWUiOiJ0ZXN0LWFnZW50LXVwZ3JhZGUtbnZxOCIsInN0YXR1cyI6InllbGxvdyIsInRpbWVkX291dCI6dHJ1ZSwibnVtYmVyX29mX25vZGVzIjozLCJudW1iZXJfb2ZfZGF0YV9ub2Rlcy
I6MywiYWN0aXZlX3ByaW1hcnlfc2hhcmRzIjozNSwiYWN0aXZlX3NoYXJkcyI6NjQsInJlbG9jYXRpbmdfc2hhcmRzIjowLCJpbml0aWFsaXppbmdfc2hhcmRzIjoyLCJ1bmFzc2lnbmVkX3NoYXJkcyI6NCwiZGVsYXllZF91bmFzc2lnbmVkX3NoYXJkcyI6NCwibnVtYmVyX29mX3BlbmRpbmdfdGFza3MiOjEsIm51bWJlcl9vZl9pbl9mbGlnaHRfZmV0Y2giOjAsInRhc2tfbWF4X3dhaXRpbmdfaW5fcXVldWVfbWlsbGlzIjowLCJh
Y3RpdmVfc2hhcmRzX3BlcmNlbnRfYXNfbnVtYmVyIjo5MS40Mjg1NzE0Mjg1NzE0M30=", "error": "json: cannot unmarshal string into Go struct field ErrorResponse.status of type int"}                                                                                                                                                                
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.newAPIError                                                                                                                                                                                                                                                       
        /home/thb/dev/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/error.go:46                                                                                                                                                                                                                             
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*baseClient).doRequest                                                                                                                                                                                                                                           
        /home/thb/dev/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/base.go:90                                                                                                                                                                                                                              
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*baseClient).request                                                                                                                                                                                                                                             
        /home/thb/dev/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/base.go:146                                                                                                                                                                                                                             
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*clientV6).GetClusterHealthWaitForAllEvents                                                                                                                                                                                                                      
        /home/thb/dev/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/v6.go:95                                                                                                                                                                                                                                
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*memoizingHealth).initialize                                                                                                                                                                                                                                     
        /home/thb/dev/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/esstate.go:139                                                                                                                                                                                                                          
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.initOnce.func1                                                                                                                                                                                                                                                    
        /home/thb/dev/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/esstate.go:49                                                       
sync.(*Once).doSlow                                                                                                                                               
        /home/thb/dev/tools/go/src/sync/once.go:68                                                                                                                
sync.(*Once).Do                                                                                                                                                   
        /home/thb/dev/tools/go/src/sync/once.go:59

echo 'eyJjbHVzdGVyX25hbWUiOiJ0ZXN0LWFnZW50LXVwZ3JhZGUtbnZxOCIsInN0YXR1cyI6InllbGxvdyIsInRpbWVkX291dCI6dHJ1ZSwibnVtYmVyX29mX25vZGVzIjozLCJudW1iZXJfb2ZfZGF0YV9ub2Rlcy
I6MywiYWN0aXZlX3ByaW1hcnlfc2hhcmRzIjozNSwiYWN0aXZlX3NoYXJkcyI6NjQsInJlbG9jYXRpbmdfc2hhcmRzIjowLCJpbml0aWFsaXppbmdfc2hhcmRzIjoyLCJ1bmFzc2lnbmVkX3NoYXJkcyI6NCwiZGVsYXllZF91bmFzc2lnbmVkX3NoYXJkcyI6NCwibnVtYmVyX29mX3BlbmRpbmdfdGFza3MiOjEsIm51bWJlcl9vZl9pbl9mbGlnaHRfZmV0Y2giOjAsInRhc2tfbWF4X3dhaXRpbmdfaW5fcXVldWVfbWlsbGlzIjowLCJh
Y3RpdmVfc2hhcmRzX3BlcmNlbnRfYXNfbnVtYmVyIjo5MS40Mjg1NzE0Mjg1NzE0M30=' | base64 -d | jq .
{
  "cluster_name": "test-agent-upgrade-nvq8",
  "status": "yellow",
  "timed_out": true,
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "active_primary_shards": 35,
  "active_shards": 64,
  "relocating_shards": 0,
  "initializing_shards": 2,
  "unassigned_shards": 4,
  "delayed_unassigned_shards": 4,
  "number_of_pending_tasks": 1,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 91.42857142857143
}

Feb 09 '22 16:02 thbkrkr

Last comment was me being stupid. I forgot to convert the []byte body to a string when I logged it so the logging framework printed the []byte in base64. It accepts an interface{} and does some conversions like this for some types. For the cannot unmarshal string error, by default we try to parse the ES error response using the ErrorResponse struct and we expect to get an error for 408 on /_cluster/health which returns a normal body.

The only thing to retain is that the operator gets a few 408s when requesting /_cluster/health during the upgrade, which is expected: https://github.com/elastic/cloud-on-k8s/blob/780831adab3c88e57666d7a1659a589a4a2b5d70/test/e2e/test/elasticsearch/steps_mutation.go#L226-L236

Feb 10 '22 12:02 thbkrkr

Ok so this is not the exact same test but TestVersionUpgradeToLatest8x/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process but I think it might be the same problem. I am trying to summarise what I saw in a new issue which actually might be an Elasticsearch issue https://github.com/elastic/cloud-on-k8s/issues/5795

Jun 17 '22 11:06 pebrc

Closing because similar to #5795.

Jun 30 '23 08:06 thbkrkr