Description

As of writing, ccr does not offer automatic failover. Can we please add the following tutorial for the failover scenario?

The initial setup can be skipped as it's similar to Tutorial: Set up cross-cluster replication. Adding it here for completeness.

Initial setup (uni-directional CCR with `DR` cluster following `Production` cluster)

Step1: Create remote clusters on `DR` and point to `production`

### On DR cluster ###
PUT /_cluster/settings
{
  "persistent" : {
    "cluster" : {
      "remote" : {
        "production" : {
          "seeds" : [
            "127.0.0.1:9300" 
          ]
        }
      }
    }
  }
}

Step2: Create an index on `Production`

### On Production cluster ###
PUT /my_index
POST /my_index/_doc/1
{
  "foo":"bar"
}

Step3: Create follower index on DR

### On DR cluster ###
PUT /my_index/_ccr/follow 
{ 
  "remote_cluster" : "production", 
  "leader_index" : "my_index" 
}

Step4: Test follower index on DR

### On DR cluster ###
GET /my_index/search

### This should show up the content created on production (foo/bar)

⚠️ Ingestion should only be written to the Production cluster, all search queries can be directed to either Production or DR clusters.

When Production down:

Step1: On the Client's side, pause ingestion of `my_index` into `Production`.

Step2: On the Elasticsearch side, turn the follower indices in the `DR` into regular indices:

Ensure no writes are occurring on the leader index (if the data centre is down, or cluster is unavailable, no action needed) On DR: Convert the follower index to a normal index in Elasticsearch (capable of accepting writes)

### On DR cluster ###
POST /my_index/_ccr/pause_follow
POST /my_index/_close           
POST /my_index/_ccr/unfollow    
POST /my_index/_open

Step3: On the Client side, manually re-enable ingestion of my_index to the `DR` cluster. (You can test that the index should be writable:

### On DR cluster ###
POST my_index/_doc/2
{
  "foo": "new"
}

⚠️ Make sure all traffic is redirected to the DR cluster during this time.

Once the `Production` comes back:

Step1: On the clients side, stop writes to `my_index` on `DR` cluster.

Step2: Create remote clusters on `Production` and points to `DR`

### On Production cluster ###
PUT _cluster/settings
{
  "persistent" : {
    "cluster" : {
      "remote" : {
        "dr" : {
          "seeds" : [
            "127.0.0.2:9300" 
          ]
        }
      }
    }
  }
}

Step3: Create follower indices in `Production`, connecting them to the leader in `DR`. The former leader indices in `Production` have outdated data and will need to be discarded/deleted. Wait for Production follower indices to catch up. Once it is caught up, you can turn the follower indices in `Production` to regular index again.

### On Production cluster ###
DELETE my_index

### Create follower index on Production to follow from DR cluster
PUT /my_index/_ccr/follow 
{ 
  "remote_cluster" : "dr", 
  "leader_index" : "my_index" 
}

### Wait for my_index to catch up with DR and contain all the documents.
GET my_index/_search

### Stop following from DR to turn my_index into a regular index.
POST /my_index/_ccr/pause_follow
POST /my_index/_close
POST /my_index/_ccr/unfollow
POST /my_index/_open

Step4: Delete the former `DR` writeable indices that contain outdated data now. Create follower indices in the `DR` again to ensure that all changes from `Production` are streamed to `DR`. (This is the same as the initial setup)

### On DR cluster ###
DELETE my_index

### Create follower index on `DR` to follow from the `Production` cluster
PUT /my_index/_ccr/follow 
{ 
  "remote_cluster" : "production", 
  "leader_index" : "my_index" 
}

Step5: On the Client side, manually re-enable ingestion to the `Production` cluster.

⚠️ Ingestion should only be written onto Production, all search queries can be directed to either Production or DR clusters.

Mar 10 '22 04:03 Leaf-Lin

Pinging @elastic/es-docs (Team:Docs)

Mar 10 '22 04:03 elasticmachine

How should .kibana be handled?

Mar 31 '22 23:03 jasonslater2000

We removed the auto-follow pattern for system indices in 8.0.0, but we can still specify follower index to replicate a specific leader_index. For example, to replicate .kibana_8.0.0_001 from clusterA to clusterB, execute this on clusterB:

DELETE .kibana_8.0.0_001
PUT .kibana_8.0.0_001/_ccr/follow?wait_for_active_shards=1
{
  "remote_cluster" : "clusterA",
  "leader_index" : ".kibana_8.0.0_001"
}

The DR flow for system indices is the same as my initial post, just that it should avoid using _ccr/auto_follow API.

Mar 31 '22 23:03 Leaf-Lin

Thanks Leaf. Can we add this to the documentation as well, as it pertains to the DR use case, and covering failover/failback scenarios?

Customers in this situation would certainly want to understand what can be sync'ed to their DR cluster, vs what cannot be sync'ed.

Apr 02 '22 15:04 jasonslater2000

As a follow up question, what ARE the effects of setting up the .kibana as a follower index? How does this pertain to all kinds of configuration elements as well as not allowing direct access to system indices? For example, does this also work with the taskmanager part to handle the dimension of duplicate tasks/alerts? If this would also be included so that customers understand the impact dimensions that would go a long way for the right DR setup in combination with CCR.

Apr 11 '22 06:04 Arnovandevelde

@Leaf-Lin : Another follow-up question that I have to add to the list is:

Will users be able to continue to explicitly follow system indices and complete the steps to convert them to regular indices?

The last few months there has been a number of changes that have been made to make it more difficult to work with the system indices. For example, #72815, #63513, and #74212. When our users are planning their DR strategies, they want to know how forward compatible their plans are as they need to be continually patching their deployments. So knowing if explicit CCR is planned to be taken away or not is important.

Apr 14 '22 19:04 a03nikki

Pinging @elastic/es-distributed (Team:Distributed)

Apr 19 '22 05:04 elasticmachine

On following .kibana index:

Can we add this to the documentation as well, as it pertains to the DR use case, and covering failover/failback scenarios?

Although the comment will allow us to replicate this particular system index, it is not ideal. One still needs to manually delete .kibana indices on the follower cluster before they can replicate the system indices, and it does not cover situations when the cluster gets upgraded (which changes .kibana to new index names.). Furthermore, this workaround may work with .kibana, but it is not guaranteed to work with all system indices. For example, it seems to have little value in replicating the .async-search or .tasks index outside the cluster running these tasks. Unfortunately, there's currently no systematic way to select the right set of system index to be replicated. You can see a similar comment in ^1.

what ARE the effects of setting up the .kibana as a follower index? How does this pertain to all kinds of configuration elements as well as not allowing direct access to system indices? For example, does this also work with the taskmanager part to handle the dimension of duplicate tasks/alerts?

These questions are spot-on. For these reasons you have mentioned, (plus upgrade handling), the follower cluster must have .kibana set to read-only mode, thus the follower will not be able to set up alerts, visualization or any other task management that currently require writes to .kibana or .kibana_task_manager index.

On the system indices + CCR planning:

Will users be able to continue to explicitly follow system indices and complete the steps to convert them to regular indices?

We agree that disaster recovery for system indices today is not well-implemented. I have raised an enhancement request^2 which would require a cross-team effort to address.

I am not aware of any planned changes in the short term.

Apr 26 '22 08:04 Leaf-Lin

@Leaf-Lin the doc team is very busy, do you think you can provide the documentation change you proposed?

Aug 09 '22 09:08 tlrx

@Leaf-Lin the doc team is very busy, do you think you can provide the documentation change you proposed?

Sorry for the noise, I just saw you provided #87099

Aug 09 '22 10:08 tlrx

elasticsearch
elasticsearch copied to clipboard

[Docs] Step-by-step tutorial for uni-directional CCR failover

Description

Initial setup (uni-directional CCR with `DR` cluster following `Production` cluster)

Step1: Create remote clusters on `DR` and point to `production`

Step2: Create an index on `Production`

Step3: Create follower index on DR

Step4: Test follower index on DR

When Production down:

Step1: On the Client's side, pause ingestion of `my_index` into `Production`.

Step2: On the Elasticsearch side, turn the follower indices in the `DR` into regular indices:

Step3: On the Client side, manually re-enable ingestion of my_index to the `DR` cluster. (You can test that the index should be writable:

Once the `Production` comes back:

Step1: On the clients side, stop writes to `my_index` on `DR` cluster.

Step2: Create remote clusters on `Production` and points to `DR`

Step4: Delete the former `DR` writeable indices that contain outdated data now. Create follower indices in the `DR` again to ensure that all changes from `Production` are streamed to `DR`. (This is the same as the initial setup)

Step5: On the Client side, manually re-enable ingestion to the `Production` cluster.

elasticsearch elasticsearch copied to clipboard

[Docs] Step-by-step tutorial for uni-directional CCR failover

Description

Initial setup (uni-directional CCR with DR cluster following Production cluster)

Step1: Create remote clusters on DR and point to production

Step2: Create an index on Production

Step3: Create follower index on DR

Step4: Test follower index on DR

When Production down:

Step1: On the Client's side, pause ingestion of my_index into Production.

Step2: On the Elasticsearch side, turn the follower indices in the DR into regular indices:

Step3: On the Client side, manually re-enable ingestion of my_index to the DR cluster. (You can test that the index should be writable:

Once the Production comes back:

Step1: On the clients side, stop writes to my_index on DR cluster.

Step2: Create remote clusters on Production and points to DR

Step4: Delete the former DR writeable indices that contain outdated data now. Create follower indices in the DR again to ensure that all changes from Production are streamed to DR. (This is the same as the initial setup)

Step5: On the Client side, manually re-enable ingestion to the Production cluster.

elasticsearch
elasticsearch copied to clipboard

Initial setup (uni-directional CCR with `DR` cluster following `Production` cluster)

Step1: Create remote clusters on `DR` and point to `production`

Step2: Create an index on `Production`

Step1: On the Client's side, pause ingestion of `my_index` into `Production`.

Step2: On the Elasticsearch side, turn the follower indices in the `DR` into regular indices:

Step3: On the Client side, manually re-enable ingestion of my_index to the `DR` cluster. (You can test that the index should be writable:

Once the `Production` comes back:

Step1: On the clients side, stop writes to `my_index` on `DR` cluster.

Step2: Create remote clusters on `Production` and points to `DR`

Step4: Delete the former `DR` writeable indices that contain outdated data now. Create follower indices in the `DR` again to ensure that all changes from `Production` are streamed to `DR`. (This is the same as the initial setup)

Step5: On the Client side, manually re-enable ingestion to the `Production` cluster.