elasticsearch [Transform] Elasticsearch upgrades make transforms fail easily

Elasticsearch Version

main

Installed Plugins

No response

Java Version

bundled

OS Version

serverless, Cloud

Problem Description

Users are noticing transforms failing when there is an Elasticsearch version upgrade. This came up in serverless and on the Cloud. I'm not sure if this also affect stateful ES. Each such upgrade can make a transform fail. Once the transform fails, the user has to manually stop and delete it and create a new transform.

The primary purpose of this GH issue is to reduce the volume of the transform alerts and users complaints. Some questions/ideas that need to be addressed:

[ ] We've seen the problem happening for the transforms with unattended set to false. Does the problem also occur when unattended is true? If so, this is a bug as we expect unattended transforms to never fail.
[ ] Even when unattended is false, what can we do to make the transform more robust during these upgrades? Maybe we can make all the transforms slightly more "unattended", i.e. less prone to intermittent issues.
[ ] Maybe the transforms should treat all the error types as recoverable?
[ ] What is the right retrying strategy for a non-unattended transform?
[ ] Does the problem happen for a version upgrade only or does it also happen for a full cluster restart (but without changing the version)?

Steps to Reproduce

It happens during Cloud upgrades.

Logs (if relevant)

No response

### Tasks
- [ ] https://github.com/elastic/elasticsearch/issues/107215
- [ ] https://github.com/elastic/elasticsearch/issues/100891
- [ ] https://github.com/elastic/elasticsearch/issues/107263
- [ ] https://github.com/elastic/elasticsearch/issues/107266

Apr 09 '24 09:04 przemekwitek

Pinging @elastic/ml-core (Team:ML)

Apr 09 '24 09:04 elasticsearchmachine

1719 instances of Transform has failed errors on serverless over the last 90 days:

14 of The object cannot be set twice -> https://github.com/elastic/elasticsearch/issues/107215
25 of Failed to reload transform configuration for transform <>
846 of [parent] Data too large, data for [indices:data/read/search[phase/query]] would be [4092117844/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4092117096/3.8gb], new bytes reserved: [748/748b], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=394234840/375.9mb, request=0/0b, inflight_requests=2244/2.1kb]
9 of Failed to persist transform statistics for transform
3-5 per rollout that are similar to Bulk index experienced [2] failures and at least 1 irrecoverable [org.elasticsearch.index.IndexNotFoundException: no such index [.metrics-endpoint.metadata_united_default]].
2 of rejected execution of TimedRunnable

Apr 09 '24 11:04 prwhelan

Transient issues, such as temporary search or indexing problems, are retried. The transform will fail if this configurable retry count is exceeded. Workaround is to increase retries, or to fix cluster stability, or to run as unattended. These can be excluded from the scope of this initial investigation.

To help focus investigations, I would suggest we look at the behaviour of transforms whilst nodes leave and join the cluster. I believe it is likely we have some code paths here which can lead to errors and race conditions depending on the order and time it takes for these things to happen. Transforms may not react quickly enough when a node is shutting down. Transform may be too eager to start when a node is joining.

Nodes may leave/join either due to upgrade, or restart, or catastrophic node failure. It could be any node, the one running the task, the one hosting the config index etc... Because it is easier to test, I believe it's worth initially validating if transforms behave well during node movement, rather than upgrade. (Also from experience, upgrade errors tend to manifest themselves as cluster state failures .. and we don't see these atm).

Timeouts for graceful node shutdowns are longer for Serverless than for non-Serverless - so I'd prioritise Serverless initially as we've seen more alerts here (however I think both are applicable, so pick whichever is easiest to bulk-test multi node movement).

Apr 09 '24 14:04 sophiec20

To help focus investigations, I would suggest we look at the behaviour of transforms whilst nodes leave and join the cluster. I believe it is likely we have some code paths here which can lead to errors and race conditions depending on the order and time it takes for these things to happen. Transforms may not react quickly enough when a node is shutting down. Transform may be too eager to start when a node is joining.

This makes sense, and I'll try to keep the task list ordered by priority.

Apr 09 '24 23:04 prwhelan

Will look into this as well: https://github.com/elastic/elasticsearch/issues/100891

It's likely we don't have to worry about some of these inconsistencies during a rollout if we can handle the rollout

Apr 10 '24 11:04 prwhelan

Related to data too large: https://github.com/elastic/elasticsearch/issues/60391

Apr 10 '24 15:04 prwhelan

[endpoint.metadata_united-default-8.14.0] transform has failed; experienced: [Insufficient memory for search after repeated page size reductions to [0], unable to continue pivot, please simplify job or increase heap size on data nodes.].

Related to Data too large

Apr 22 '24 14:04 prwhelan

[endpoint.metadata_current-default-8.14.0] transform has failed; experienced: [task encountered irrecoverable failure: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]].

Apr 22 '24 16:04 prwhelan

A user also found these after upgrade from 8.11.1 to 8.13.2:

Transform task state is [failed]
task encountered more than 10 failures; latest failure: read past EOF: MemorySegmentIndexInput(path="/app/data/indices/mtc6-NYrQPi_irBUjusHPA/0/index/_5ro.cfs") [slice=_5ro_ES87TSDB_0.dvd] [slice=values]

Apr 25 '24 14:04 nerophon

@nerophon that seems to be an issue with the index. From searching around, it seems that it is corrupted. Do you know if the index is a Transform internal index, the source index that the Transform is searching, or the destination index that the Transform is bulk writing to? I'm not sure if there's anything the Transform can automatically do to recover in this scenario. That seems to require external intervention.

Apr 26 '24 14:04 prwhelan

A few new ones

Caused by: java.lang.IllegalArgumentException: field [message] not present as part of path [message]

Doesn't seem to reoccur

There are still a lot of WARNS due to node disconnects, missing shards, etc, that happen during nodes joining/leaving the cluster. We could potentially listen for shutdown events and handle accordingly, but there doesn't seem to be any transforms moving into failure due to these reasons

May 20 '24 14:05 prwhelan

We haven't seen unrecoverable failures in the last month - I think it is safe to mark this closed, and we can prioritize new issues outside of this meta-issue

Jun 18 '24 12:06 prwhelan

elasticsearch elasticsearch copied to clipboard

[Transform] Elasticsearch upgrades make transforms fail easily

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

elasticsearch
elasticsearch copied to clipboard