elasticsearch
elasticsearch copied to clipboard
[Transform] Elasticsearch upgrades make transforms fail easily
Elasticsearch Version
main
Installed Plugins
No response
Java Version
bundled
OS Version
serverless, Cloud
Problem Description
Users are noticing transforms failing when there is an Elasticsearch version upgrade. This came up in serverless and on the Cloud. I'm not sure if this also affect stateful ES. Each such upgrade can make a transform fail. Once the transform fails, the user has to manually stop and delete it and create a new transform.
The primary purpose of this GH issue is to reduce the volume of the transform alerts and users complaints. Some questions/ideas that need to be addressed:
- [ ] We've seen the problem happening for the transforms with
unattended
set tofalse
. Does the problem also occur whenunattended
istrue
? If so, this is a bug as we expectunattended
transforms to never fail. - [ ] Even when
unattended
isfalse
, what can we do to make the transform more robust during these upgrades? Maybe we can make all the transforms slightly more "unattended", i.e. less prone to intermittent issues. - [ ] Maybe the transforms should treat all the error types as recoverable?
- [ ] What is the right retrying strategy for a non-unattended transform?
- [ ] Does the problem happen for a version upgrade only or does it also happen for a full cluster restart (but without changing the version)?
Steps to Reproduce
It happens during Cloud upgrades.
Logs (if relevant)
No response
### Tasks
- [ ] https://github.com/elastic/elasticsearch/issues/107215
- [ ] https://github.com/elastic/elasticsearch/issues/100891
- [ ] https://github.com/elastic/elasticsearch/issues/107263
- [ ] https://github.com/elastic/elasticsearch/issues/107266
Pinging @elastic/ml-core (Team:ML)
1719 instances of Transform has failed
errors on serverless over the last 90 days:
- 14 of
The object cannot be set twice
-> https://github.com/elastic/elasticsearch/issues/107215 - 25 of
Failed to reload transform configuration for transform <>
- 846 of
[parent] Data too large, data for [indices:data/read/search[phase/query]] would be [4092117844/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4092117096/3.8gb], new bytes reserved: [748/748b], usages [model_inference=0/0b, eql_sequence=0/0b, fielddata=394234840/375.9mb, request=0/0b, inflight_requests=2244/2.1kb]
- 9 of
Failed to persist transform statistics for transform
- 3-5 per rollout that are similar to
Bulk index experienced [2] failures and at least 1 irrecoverable [org.elasticsearch.index.IndexNotFoundException: no such index [.metrics-endpoint.metadata_united_default]].
- 2 of
rejected execution of TimedRunnable
Transient issues, such as temporary search or indexing problems, are retried. The transform will fail if this configurable retry count is exceeded. Workaround is to increase retries, or to fix cluster stability, or to run as unattended. These can be excluded from the scope of this initial investigation.
To help focus investigations, I would suggest we look at the behaviour of transforms whilst nodes leave and join the cluster. I believe it is likely we have some code paths here which can lead to errors and race conditions depending on the order and time it takes for these things to happen. Transforms may not react quickly enough when a node is shutting down. Transform may be too eager to start when a node is joining.
Nodes may leave/join either due to upgrade, or restart, or catastrophic node failure. It could be any node, the one running the task, the one hosting the config index etc... Because it is easier to test, I believe it's worth initially validating if transforms behave well during node movement, rather than upgrade. (Also from experience, upgrade errors tend to manifest themselves as cluster state failures .. and we don't see these atm).
Timeouts for graceful node shutdowns are longer for Serverless than for non-Serverless - so I'd prioritise Serverless initially as we've seen more alerts here (however I think both are applicable, so pick whichever is easiest to bulk-test multi node movement).
To help focus investigations, I would suggest we look at the behaviour of transforms whilst nodes leave and join the cluster. I believe it is likely we have some code paths here which can lead to errors and race conditions depending on the order and time it takes for these things to happen. Transforms may not react quickly enough when a node is shutting down. Transform may be too eager to start when a node is joining.
This makes sense, and I'll try to keep the task list ordered by priority.
Will look into this as well: https://github.com/elastic/elasticsearch/issues/100891
It's likely we don't have to worry about some of these inconsistencies during a rollout if we can handle the rollout
Related to data too large
: https://github.com/elastic/elasticsearch/issues/60391
[endpoint.metadata_united-default-8.14.0] transform has failed; experienced: [Insufficient memory for search after repeated page size reductions to [0], unable to continue pivot, please simplify job or increase heap size on data nodes.].
Related to Data too large
[endpoint.metadata_current-default-8.14.0] transform has failed; experienced: [task encountered irrecoverable failure: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]].
A user also found these after upgrade from 8.11.1
to 8.13.2
:
Transform task state is [failed]
task encountered more than 10 failures; latest failure: read past EOF: MemorySegmentIndexInput(path="/app/data/indices/mtc6-NYrQPi_irBUjusHPA/0/index/_5ro.cfs") [slice=_5ro_ES87TSDB_0.dvd] [slice=values]
@nerophon that seems to be an issue with the index. From searching around, it seems that it is corrupted. Do you know if the index is a Transform internal index, the source index that the Transform is searching, or the destination index that the Transform is bulk writing to? I'm not sure if there's anything the Transform can automatically do to recover in this scenario. That seems to require external intervention.
A few new ones
Caused by: java.lang.IllegalArgumentException: field [message] not present as part of path [message]
Doesn't seem to reoccur
There are still a lot of WARNS due to node disconnects, missing shards, etc, that happen during nodes joining/leaving the cluster. We could potentially listen for shutdown events and handle accordingly, but there doesn't seem to be any transforms moving into failure due to these reasons
We haven't seen unrecoverable failures in the last month - I think it is safe to mark this closed, and we can prioritize new issues outside of this meta-issue