OpenSearch [RFC] API for decommissioning/recommissioning zone and weighted zonal search request routing policy

Is your feature request related to a problem? Please describe. #3402 aims to build support for decommissioning and recommissioning a zone based on the value assigned to a zonal value. Similarly, #2859 aims to build support for weighted zonal search request using weighted round robin mechanism. We need to have a consistent and precise API structure finalised for the same.

Scope of this issue is limited to finalise the API structure for zonal decommission/recommission and weighted zonal search request.

Describe the solution you'd like Below are API structure that we can use for zonal decommission/recommission and weighted zonal search request.

Zone Decommission

PUT /_cluster/decommission/awareness/<zone>/<zone-a>

{
      "acknowledged": true
}

Zone Recommission

DELETE /_cluster/decommission/awareness/<zone>/<zone-a>

{
      "acknowledged": true
}

Get Zone Decommission Status

GET /_cluster/decommission/awareness/_status

{
     "status": "PROCESSING | DECOMMISSIONING | DECOMMISSIONED | DECOMMISSION_FAILED | RECOMMISSIONING | RECOMMISSION_FAILED",
      "awareness": {
		"zone": "zone-A"
	}
}

Get Zone Decommission Status For Local Node

GET _cluster/decommission/awareness/_status?local

{
    "status": "PROCESSING | DECOMMISSIONING | DECOMMISSIONED | DECOMMISSION_FAILED | RECOMMISSIONING | RECOMMISSION_FAILED"
}

Weighted Round Robin for search request


PUT _cluster/shard_routing/weights
{ 
"awareness" : {
      "zone" : {
	      "zone_1": "1", 
 	      "zone_2": "1", 
 	      "zone_3": "0"
        }
    }
}

{
     "acknowledged": true,
      "awareness" : {
            "zone" : {
	          "zone_1": "1", 
 	          "zone_2": "1", 
 	          "zone_3": "0"
        }
    }
}

Get weight for a local node

GET _cluster/shard_routing/weights?local

{ 
     “weight” : 0
}

Get Weight

GET _cluster/shard_routing/weights

"awareness" : {
            "zone" : {
	          "zone_1": "1", 
 	          "zone_2": "1", 
 	          "zone_3": "0"
        }
    }

The PUT /_cluster/decommission/awareness/<zone>/<zoneA> would ensure it modifies the weights to weigh away the traffic of the zone attribute and would also check if there is no incoming HTTP traffic or search traffic to the weighed away zone. If there is traffic it moves the status to DRAINING once incoming HTTP traffic and search traffic is drained, the decommission is executed.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

Jun 21 '22 10:06 imRishN

@imRishN Could we please add req(/res) for obtaining current status of Recommission/Decommission as well?

Jun 22 '22 04:06 saikaranam-amazon

@saikaranam-amazon updated the req/res for both update/get calls for recommission/decommission

Jun 26 '22 18:06 imRishN

Thanks @imRishN

Can we store additional information to reflect the updated-at timestamp for the changes? (and this can be part of the response payload for all the GET calls)

Jun 28 '22 06:06 saikaranam-amazon

@saikaranam-amazon could you help elaborate on why that might be needed?

Jun 28 '22 06:06 Bukhtawar

As we have two APIs to update the weights of the search traffic and decommissioning entire zone(without any additional checks/wait time), the later operation might incur instability for some of the inflight operations. with updated-at field, customer can make conscious decision (to decommission zone) based on the current updated weights timestamp for search traffic and can further help in automation as well.

Also, for all the Update APIs, Does those support conditional update based on the current weights passed in the payload and updated values in the current cluster state? (for de(/re)commission)
And will the absence of the recommission/decommission state, will the APIs return the empty response? rather can we have decommission state (bool) as part of the payload itself for the GET calls?

Jun 28 '22 06:06 saikaranam-amazon

@saikaranam-amazon could you help elaborate on why that might be needed?

sure @Bukhtawar update above.

Jun 28 '22 06:06 saikaranam-amazon

Thanks @imRishN .

should we have Zone Decommission API synchronous ? The Get commission status API essentially will output the state preserved in cluster state. By making API sync , the status API need can be avoided .

If users want to know the status of decommission, _cat/nodes, _cat/master and _cluster/health should be able to provide the granular details .

Jul 15 '22 12:07 gbbafna

When we decommission the zone it is possible that traffic hasn't be DRAINED in which case it might take longer and calls getting timed out. The GET API can perform more exhaustive checks on traffic drain, there could be more graceful checks in future around ongoing snapshots, shard relocation etc which would take time to complete. Having a GET API would help extend for other cases as well.

Jul 15 '22 13:07 Bukhtawar

@Bukhtawar : This makes sense . But do you think we can start without it for now and iterate on it later based on the need ?

In the case where traffic hasn't be DRAINED, we would return the API call , with the reasons for the same. A user can call the APIs with a lower timeout and see the reasons for that getting stalled. As and when we add more checks around snapshots, shard relocation , that will automatically get added to the reasons as well .

Jul 15 '22 13:07 gbbafna

Can we add labels for "roadmap" and the version of OpenSearch this is targeting? I can add it to the overall project roadmap in the right column once that is done.

Jul 20 '22 13:07 elfisher

Regarding the Put call for Decommission API

{
     "status": "PROCESSING | DRAINING | COMPLETED",
      "awareness": {
		"zone": "zone-A"
	}
}

How are responding regarding failures in executing the call? - ( May be let's track FAILED as a status value. )

Jul 26 '22 14:07 saikaranam-amazon

Regarding the Get call for commission status

{
	"awareness": {
		"zone": "zone-A",
                 "zone": "zone-B",
                 "zone": "zone-C"
	}
}

Can we have the list of values under zones as key? - and are we not foreseeing any async operations that can be captured under the status field similar to Decommission call?

Jul 26 '22 14:07 saikaranam-amazon

How are responding regarding failures in executing the call? - ( May be let's track FAILED as a status value. )

Updated the API contract above, including FAILED as a status value

Can we have the list of values under zones as key? - and are we not foreseeing any async operations that can be captured under the status field similar to Decommission call?

This makes sense to have a list under the same key. Updated the details

Jul 28 '22 09:07 imRishN

Should we change _local to local in GET _cluster/shard_routing/weights?_local to keep it consistent with other _cluster/* apis like _cluster/state.

Aug 09 '22 08:08 anshu1106

Updated the API structures above

Aug 09 '22 14:08 imRishN

@reta @dblock @elfisher any thoughts on above APIs?

Aug 10 '22 06:08 imRishN

Thanks for summarizing the API design @imRishN , I personally see large disconnect between the existing routing awareness and suggested decomissioning / recommissioning API.

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "zone",
    "cluster.routing.allocation.awareness.force.zone.values":["zoneA", "zoneB"]
  }
}

Logically, it looks to me that decomissioning == remove zone from cluster.routing.allocation.awareness.force.zone.values whereas recomissioning == reintroduce zone cluster.routing.allocation.awareness.force.zone.values.

The weights could be modeled in the similar fashion using "cluster.routing.allocation.awareness.force.zone.weights" : [1, 0, 0] setting.

I have nothing against introducing dedicated APIs but it is going to be difficult and confusing to maintain the API/settings split. Also, one important thing to keep in mind is that cluster settings could be persistent or transient, I believe it equally applies to decomissioning / recommissioning, for example when people rescale clusters withing same zones - the settings could be set to temporarily exclude some zones and reintroduce them back after restart.

Does it make sense or I completely derailed the conversation?

Aug 15 '22 16:08 reta

@reta, thanks for taking a look into the RFC. The cluster settings that you mentioned above is more of shard allocation strategy based on the awareness attribute set to the cluster.

As part of decommissioning an awareness attribute, we intend to remove the nodes from the cluster during zonal outages as it might be operating in a degraded manner and impacting the overall cluster's availability. Today, any write request requires a response from all the shard copies before the request is acknowledged. During zonal outages, this model can impact the writes to the cluster as any slow copy or impairment will slow down the writes significantly. The API design gives the user flexibility to remove the nodes present in an impacted zone out from the cluster and mark shards there as unavailable.

Logically, it looks to me that decomissioning == remove zone from cluster.routing.allocation.awareness.force.zone.values whereas recomissioning == reintroduce zone cluster.routing.allocation.awareness.force.zone.values.

We don't need to remove the zone from force zone values as it might trigger a storm of shard recoveries impacting latencies due to additional CPU and network consumption. We will let shard stay in UNASSIGNED state after decommissioning the zone. During recovery, the user can decide on recommissioning the zone back again.

More details on recommission and decommissioning a zone can be found here #3402

Aug 17 '22 19:08 imRishN

@imRishN aha, I see, thanks for clarification, I think I have even more questions, this time regarding the API:

PUT /_cluster/decommission/awareness/<zone>/<zone-a>
DELETE /_cluster/decommission/awareness/<zone>/<zone-a>

The <zone> seems to be off here, what we need is the awareness attribute (which could be zone), so the APIs could be generalized this way:

PUT /_cluster/decommission/awareness/<attribute>/<value>
DELETE /_cluster/decommission/awareness/<<attribute>/<value>

And in case of zone attribute:

PUT /_cluster/decommission/awareness/zone/<zone-a>
DELETE /_cluster/decommission/awareness/zone/<zone-a>

Regarding weights, why we are introducing shard_routing?

GET /_cluster/shard_routing/weights?local
PUT /_cluster/shard_routing/weights

The termilogy we settled upon is just routing so I think we should stick to that?

GET /_cluster/routing/weights?local
PUT /_cluster/routing/weights

Even better (arguably) approach is to follow decommission/decommission and design something like this:

GET /_cluster/routing/awareness/<attribute>/weights?local
PUT /_cluster/routing/awareness/<attribute>weights

WDYT?

Aug 18 '22 17:08 reta

@reta,

The seems to be off here, what we need is the awareness attribute (which could be zone), so the APIs could be generalized this way: PUT /_cluster/decommission/awareness/{attribute}/{value} DELETE /_cluster/decommission/awareness/{attribute}/{value}

That's correct. The API will take in the awareness attribute set to the cluster by the setting cluster.routing.allocation.awareness.attributes and will be validating against the value this setting has. zone was an example here. I have created a draft PR implementing the same #4261

Aug 19 '22 06:08 imRishN

@reta, thanks for the suggestions. I have updated the API path for weights as well. Let me know if this looks good to you?

Aug 22 '22 12:08 imRishN

@reta, thanks for the suggestions. I have updated the API path for weights as well. Let me know if this looks good to you?

Thanks @imRishN , it looks concise to me (minor typo with missed slash, PUT /_cluster/routing/awareness/<attribute>weights -> PUT /_cluster/routing/awareness/<attribute>/weights)

Aug 22 '22 12:08 reta

@reta, thanks for pointing out. Updated above.

Aug 22 '22 12:08 imRishN

Default looks like - {"msg":"Weights are not set"}. Shouldn't we just return empty object {}

Aug 25 '22 10:08 shuklas

@imRishN I am not completely sure on the URL pattern for decommission. We are decommissioning nodes the cluster as part of this API but not clear from the URL.

This is what I think it should be:

PUT /_nodes/awareness/<zone>/<zone-a>/_decommission

Aug 26 '22 06:08 sachinpkale

@sachinpkale this is another way to look at it, but I believe we aim to decommission an awareness attribute value (zone in this case), which means - everything related to this zone, including nodes (which is probably the only thing we decommission). But I see your point, it has valid concerns

Aug 26 '22 12:08 reta

Default looks like - {"msg":"Weights are not set"}. Shouldn't we just return empty object {}

Will make the response empty object

Aug 29 '22 02:08 anshu1106

@sachinpkale Although, the attribute key value is a node property, but awareness in general is a cluster property. This is how we set the awareness attribute to the cluster - "cluster.routing.allocation.awareness.attributes": "zone".

Also, I feel, /_nodes/awareness/<zone>/<zone-a>/_decommission will create confusion when we extend this feature to decommission only a set of nodes from a particular zone, or for that matter any such extensions. Also, I'll create another issue for implementing nodes decommission irrespective of awareness attribute. And in case if we purse it further then /_cluster/decommission can be simply used as path prefix. Let me know if this clears your doubt or do you still see concerns with the above APIs

Sep 02 '22 05:09 imRishN

Closing this issue as ALL the API PRs are merged now

Oct 17 '22 06:10 imRishN