OpenSearch
OpenSearch copied to clipboard
[BUG] illegal state: trying to move shard from primary mode to replica mode (Index-type: remote_snapshot)
Describe the bug
During restart, OpenSearch appears to attempt to relocate the Primary-shard of a remote_snapshot type Index and fails.
This might be an instance of the problem mentioned in https://github.com/opensearch-project/OpenSearch/pull/11563#issuecomment-1857798490.
[2024-02-15T15:43:08,808][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [10.0.1.146] Swap relocation performed for shard [[index_5][0], node[ECzLzBEhTYmA58qyuEWNaQ], [R], s[STARTED], a[id=WGd5CZOZTf2-qD411BjkoQ]]
[2024-02-15T15:43:09,012][WARN ][o.o.i.c.IndicesClusterStateService] [10.0.1.146] [index_5][0] marking and sending shard failed due to [failed updating shard routing entry]
java.lang.IllegalArgumentException: illegal state: trying to move shard from primary mode to replica mode. Current [index_5][0], node[ECzLzBEhTYmA58qyuEWNaQ], [P], s[STARTED], a[id=WGd5CZOZTf2-qD411BjkoQ], new [index_5][0], node[ECzLzBEhTYmA58qyuEWNaQ], [R], s[STARTED], a[id=WGd5CZOZTf2-qD411BjkoQ]
at org.opensearch.index.shard.IndexShard.updateShardState(IndexShard.java:597) ~[opensearch-2.11.1.jar:2.11.1]
at org.opensearch.indices.cluster.IndicesClusterStateService.updateShard(IndicesClusterStateService.java:710) [opensearch-2.11.1.jar:2.11.1]
at org.opensearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:650) [opensearch-2.11.1.jar:2.11.1]
at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:293) [opensearch-2.11.1.jar:2.11.1]
at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:606) [opensearch-2.11.1.jar:2.11.1]
at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:593) [opensearch-2.11.1.jar:2.11.1]
at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:561) [opensearch-2.11.1.jar:2.11.1]
at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:484) [opensearch-2.11.1.jar:2.11.1]
at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:186) [opensearch-2.11.1.jar:2.11.1]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-2.11.1.jar:2.11.1]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282) [opensearch-2.11.1.jar:2.11.1]
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245) [opensearch-2.11.1.jar:2.11.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
This appears to leave a Replica shard perpetually in the state of INITIALIZING:
ubuntu@ip-10-0-252-254:~$ curl -s -XGET "http://******:*******@10.0.1.146:9200/_cat/recovery?active_only=true&v=true"
index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
index_5 0 57.4m peer init 10.0.1.204 10.0.1.204 10.0.1.146 10.0.1.146 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 -1 0 -1.0%
ubuntu@ip-10-0-252-254:~$
Without any obvious causes of why:
ubuntu@ip-10-0-252-254:~$ curl -s -XGET "http://******:*******@10.0.1.146:9200/_cat/allocation?v"
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
16 29.4gb 16gb 32.2gb 48.2gb 33 10.0.1.6 10.0.1.6 10.0.1.6
15 22.5gb 9gb 39.2gb 48.2gb 18 10.0.1.204 10.0.1.204 10.0.1.204
17 36.7gb 16.1gb 32.1gb 48.2gb 33 10.0.1.146 10.0.1.146 10.0.1.146
ubuntu@ip-10-0-252-254:~$
ubuntu@ip-10-0-252-254:~$ curl -s -XGET "http://******:*******@10.0.1.146:9200/_cluster/allocation/explain?pretty"
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
}
],
"type" : "illegal_argument_exception",
"reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
},
"status" : 400
}
ubuntu@ip-10-0-252-254:~$
Related component
Storage:Snapshots
To Reproduce
This isn't exactly trivial to reproduce, there seems to be something else involved that causes the problem. Here are the steps taken to arrive at the current-state however:
- Create a OpenSearch multi-node cluster.
- Index some data into an index.
- Setup Searchable Snapshots & create one for the index from the previous step: https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot/#create-a-searchable-snapshot-index
- Restart the OpenSearch cluster.
Expected behavior
The expected behavior is for all shards to be successfully recovered upon restart without operations that result in a Yellow-state (e.g. orphaned replica-shards).
Additional Details
Plugins Please list all plugins currently enabled.
opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-security-analytics
opensearch-sql
prometheus-exporter
repository-s3
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: [e.g. iOS]
Debian - Version [e.g. 22]
Bullseye - OpenSearch:
2.11.1
Additional context This might be an instance of the problem mentioned in https://github.com/opensearch-project/OpenSearch/pull/11563#issuecomment-1857798490.
Thanks @etgraylog! This does indeed look like the issue fixed by #11563. That fix is included in 2.12, which will be released in the coming week. Will you be able to pick up that release and test this?
Thanks @etgraylog! This does indeed look like the issue fixed by #11563. That fix is included in 2.12, which will be released in the coming week. Will you be able to pick up that release and test this?
Thanks @andrross ! Certainly, I'll stay tuned 👍
Thanks @etgraylog! This does indeed look like the issue fixed by #11563. That fix is included in 2.12, which will be released in the coming week. Will you be able to pick up that release and test this?
Thanks @andrross ! Certainly, I'll stay tuned 👍
With 2.12.0 I'm not able to reproduce the issue so far, It seems the fix is working @andrross. Thanks again!