beats Libbeat caches version from last Elasticsearch connection

Summary

After upgrading a cloud deployment from 8.0.1 to 8.1.0 this error can be found in metricbeat logs.

Failed to connect to backoff(elasticsearch(http://containerhost:9244)): Connection marked as failed because the onConnect callback failed: Elasticsearch is too old. Please upgrade the instance. If you would like to connect to older instances set output.elasticsearch.allow_older_versions to true. ES=8.0.1, Beat=8.1.0.

It seems that metricbeat have a stale ES version cached and continuously fails to connect.

The error is generated from this callback which ends up pinging ES unless the cached version is already valid. I've pinged my nodes with the API console and while the load balancer only routed request to 2/3 nodes, these nodes were returning the expected version so I suppose this property is stale somehow.

Note that this did not happen upgrading from 8.0.0 to 8.0.1 and this should only appear when upgrading minor.

This could happen if metricbeat completely upgrades while ES is still reporting 8.0.1, caches the connection and still relies on it when ES has done upgraded. I'd expect the connection to be torn at some point and removed from the pool, or maybe it's kept around until we read/write from it ?

May 16 '22 10:05 klacabane

Is there a work-around to restart nodes or something like that?

Jun 29 '22 17:06 LeeDr

@LeeDr I think the only workaround atm is to restart the metricbeat processes. In cloud that would boil down to two options:

disabling and re-enabling Metrics collection
restarting the cluster

Jun 29 '22 18:06 klacabane

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Jul 27 '22 12:07 elasticmachine

It is apparently possible for the cluster versions to temporarily be mixed between the new and the old version of Elasticsearch during an upgrade and our version check doesn't account for this. More details on how upgrades work from a recent cloud upgrade issue:

Each instance will be running the same version of filebeat/metricbeat as the version of Elasticsearch/Kibana/etc. During an upgrade, individual instances are upgraded one after another. The cluster is then in a mixed state, so the master will still be on the old version, while some nodes will already be on the new version (and already sending data back to itself)

I think this is effectively a race condition during upgrades introduced by the changes in this commit starting in v8.1.0.

The public GetVersion method of our ES client used during the version check caches the version: https://github.com/elastic/beats/blob/c61915e183143e91f564d1f6c70e4df44ff677b9/libbeat/cmd/instance/beat.go#L896-L898

https://github.com/elastic/beats/blob/c61915e183143e91f564d1f6c70e4df44ff677b9/libbeat/esleg/eslegclient/connection.go#L383-L390

Aug 02 '22 18:08 cmacknz

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

Aug 02 '23 19:08 botelastic[bot]

beats beats copied to clipboard

Libbeat caches version from last Elasticsearch connection

Summary

beats
beats copied to clipboard