elasticsearch
elasticsearch copied to clipboard
[Docs] Step-by-step tutorial for uni-directional CCR failover
Description
As of writing, ccr does not offer automatic failover. Can we please add the following tutorial for the failover scenario?
The initial setup can be skipped as it's similar to Tutorial: Set up cross-cluster replication. Adding it here for completeness.
Initial setup (uni-directional CCR with DR
cluster following Production
cluster)
Step1: Create remote clusters on DR
and point to production
### On DR cluster ###
PUT /_cluster/settings
{
"persistent" : {
"cluster" : {
"remote" : {
"production" : {
"seeds" : [
"127.0.0.1:9300"
]
}
}
}
}
}
Step2: Create an index on Production
### On Production cluster ###
PUT /my_index
POST /my_index/_doc/1
{
"foo":"bar"
}
Step3: Create follower index on DR
### On DR cluster ###
PUT /my_index/_ccr/follow
{
"remote_cluster" : "production",
"leader_index" : "my_index"
}
Step4: Test follower index on DR
### On DR cluster ###
GET /my_index/search
### This should show up the content created on production (foo/bar)
⚠️ Ingestion should only be written to the Production
cluster, all search queries can be directed to either Production
or DR
clusters.
When Production down:
Step1: On the Client's side, pause ingestion of my_index
into Production
.
Step2: On the Elasticsearch side, turn the follower indices in the DR
into regular indices:
Ensure no writes are occurring on the leader index (if the data centre is down, or cluster is unavailable, no action needed) On DR: Convert the follower index to a normal index in Elasticsearch (capable of accepting writes)
### On DR cluster ###
POST /my_index/_ccr/pause_follow
POST /my_index/_close
POST /my_index/_ccr/unfollow
POST /my_index/_open
Step3: On the Client side, manually re-enable ingestion of my_index to the DR
cluster. (You can test that the index should be writable:
### On DR cluster ###
POST my_index/_doc/2
{
"foo": "new"
}
⚠️ Make sure all traffic is redirected to the DR
cluster during this time.
Once the Production
comes back:
Step1: On the clients side, stop writes to my_index
on DR
cluster.
Step2: Create remote clusters on Production
and points to DR
### On Production cluster ###
PUT _cluster/settings
{
"persistent" : {
"cluster" : {
"remote" : {
"dr" : {
"seeds" : [
"127.0.0.2:9300"
]
}
}
}
}
}
Step3: Create follower indices in Production
, connecting them to the leader in DR
. The former leader indices in Production
have outdated data and will need to be discarded/deleted. Wait for Production follower indices to catch up. Once it is caught up, you can turn the follower indices in Production
to regular index again.
### On Production cluster ###
DELETE my_index
### Create follower index on Production to follow from DR cluster
PUT /my_index/_ccr/follow
{
"remote_cluster" : "dr",
"leader_index" : "my_index"
}
### Wait for my_index to catch up with DR and contain all the documents.
GET my_index/_search
### Stop following from DR to turn my_index into a regular index.
POST /my_index/_ccr/pause_follow
POST /my_index/_close
POST /my_index/_ccr/unfollow
POST /my_index/_open
Step4: Delete the former DR
writeable indices that contain outdated data now. Create follower indices in the DR
again to ensure that all changes from Production
are streamed to DR
. (This is the same as the initial setup)
### On DR cluster ###
DELETE my_index
### Create follower index on `DR` to follow from the `Production` cluster
PUT /my_index/_ccr/follow
{
"remote_cluster" : "production",
"leader_index" : "my_index"
}
Step5: On the Client side, manually re-enable ingestion to the Production
cluster.
⚠️ Ingestion should only be written onto Production
, all search queries can be directed to either Production
or DR
clusters.
Pinging @elastic/es-docs (Team:Docs)
How should .kibana be handled?
We removed the auto-follow pattern for system indices in 8.0.0, but we can still specify follower index to replicate a specific leader_index. For example, to replicate .kibana_8.0.0_001
from clusterA to clusterB, execute this on clusterB:
DELETE .kibana_8.0.0_001
PUT .kibana_8.0.0_001/_ccr/follow?wait_for_active_shards=1
{
"remote_cluster" : "clusterA",
"leader_index" : ".kibana_8.0.0_001"
}
The DR flow for system indices is the same as my initial post, just that it should avoid using _ccr/auto_follow
API.
Thanks Leaf. Can we add this to the documentation as well, as it pertains to the DR use case, and covering failover/failback scenarios?
Customers in this situation would certainly want to understand what can be sync'ed to their DR cluster, vs what cannot be sync'ed.
As a follow up question, what ARE the effects of setting up the .kibana as a follower index? How does this pertain to all kinds of configuration elements as well as not allowing direct access to system indices? For example, does this also work with the taskmanager part to handle the dimension of duplicate tasks/alerts? If this would also be included so that customers understand the impact dimensions that would go a long way for the right DR setup in combination with CCR.
@Leaf-Lin : Another follow-up question that I have to add to the list is:
Will users be able to continue to explicitly follow system indices and complete the steps to convert them to regular indices?
The last few months there has been a number of changes that have been made to make it more difficult to work with the system indices. For example, #72815, #63513, and #74212. When our users are planning their DR strategies, they want to know how forward compatible their plans are as they need to be continually patching their deployments. So knowing if explicit CCR is planned to be taken away or not is important.
Pinging @elastic/es-distributed (Team:Distributed)
- On following
.kibana
index:
Can we add this to the documentation as well, as it pertains to the DR use case, and covering failover/failback scenarios?
Although the comment will allow us to replicate this particular system index, it is not ideal. One still needs to manually delete .kibana
indices on the follower cluster before they can replicate the system indices, and it does not cover situations when the cluster gets upgraded (which changes .kibana
to new index names.). Furthermore, this workaround may work with .kibana
, but it is not guaranteed to work with all system indices. For example, it seems to have little value in replicating the .async-search
or .tasks
index outside the cluster running these tasks. Unfortunately, there's currently no systematic way to select the right set of system index to be replicated. You can see a similar comment in ^1.
what ARE the effects of setting up the
.kibana
as a follower index? How does this pertain to all kinds of configuration elements as well as not allowing direct access to system indices? For example, does this also work with the taskmanager part to handle the dimension of duplicate tasks/alerts?
These questions are spot-on. For these reasons you have mentioned, (plus upgrade handling), the follower cluster must have .kibana
set to read-only mode, thus the follower will not be able to set up alerts, visualization or any other task management that currently require writes to .kibana
or .kibana_task_manager
index.
- On the system indices + CCR planning:
Will users be able to continue to explicitly follow system indices and complete the steps to convert them to regular indices?
We agree that disaster recovery for system indices today is not well-implemented. I have raised an enhancement request^2 which would require a cross-team effort to address.
I am not aware of any planned changes in the short term.
@Leaf-Lin the doc team is very busy, do you think you can provide the documentation change you proposed?
@Leaf-Lin the doc team is very busy, do you think you can provide the documentation change you proposed?
Sorry for the noise, I just saw you provided #87099