OpenSearch
OpenSearch copied to clipboard
[RFC] API for decommissioning/recommissioning zone and weighted zonal search request routing policy
Is your feature request related to a problem? Please describe. #3402 aims to build support for decommissioning and recommissioning a zone based on the value assigned to a zonal value. Similarly, #2859 aims to build support for weighted zonal search request using weighted round robin mechanism. We need to have a consistent and precise API structure finalised for the same.
Scope of this issue is limited to finalise the API structure for zonal decommission/recommission and weighted zonal search request.
Describe the solution you'd like Below are API structure that we can use for zonal decommission/recommission and weighted zonal search request.
Zone Decommission
PUT /_cluster/decommission/awareness/<zone>/<zone-a>
{
"acknowledged": true
}
Zone Recommission
DELETE /_cluster/decommission/awareness/<zone>/<zone-a>
{
"acknowledged": true
}
Get Zone Decommission Status
GET /_cluster/decommission/awareness/_status
{
"status": "PROCESSING | DECOMMISSIONING | DECOMMISSIONED | DECOMMISSION_FAILED | RECOMMISSIONING | RECOMMISSION_FAILED",
"awareness": {
"zone": "zone-A"
}
}
Get Zone Decommission Status For Local Node
GET _cluster/decommission/awareness/_status?local
{
"status": "PROCESSING | DECOMMISSIONING | DECOMMISSIONED | DECOMMISSION_FAILED | RECOMMISSIONING | RECOMMISSION_FAILED"
}
Weighted Round Robin for search request
PUT _cluster/shard_routing/weights
{
"awareness" : {
"zone" : {
"zone_1": "1",
"zone_2": "1",
"zone_3": "0"
}
}
}
{
"acknowledged": true,
"awareness" : {
"zone" : {
"zone_1": "1",
"zone_2": "1",
"zone_3": "0"
}
}
}
Get weight for a local node
GET _cluster/shard_routing/weights?local
{
“weight” : 0
}
Get Weight
GET _cluster/shard_routing/weights
"awareness" : {
"zone" : {
"zone_1": "1",
"zone_2": "1",
"zone_3": "0"
}
}
The PUT /_cluster/decommission/awareness/<zone>/<zoneA>
would ensure it modifies the weights to weigh away the traffic of the zone attribute and would also check if there is no incoming HTTP traffic or search traffic to the weighed away zone. If there is traffic it moves the status to DRAINING
once incoming HTTP traffic and search traffic is drained, the decommission is executed.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
@imRishN Could we please add req(/res) for obtaining current status of Recommission/Decommission as well?
@saikaranam-amazon updated the req/res for both update/get calls for recommission/decommission
Thanks @imRishN
- Can we store additional information to reflect the
updated-at
timestamp for the changes? (and this can be part of the response payload for all the GET calls)
@saikaranam-amazon could you help elaborate on why that might be needed?
As we have two APIs to update the weights of the search traffic and decommissioning entire zone(without any additional checks/wait time), the later operation might incur instability for some of the inflight operations.
with updated-at
field, customer can make conscious decision (to decommission zone) based on the current updated weights timestamp for search traffic and can further help in automation as well.
- Also, for all the Update APIs, Does those support conditional update based on the current weights passed in the payload and updated values in the current cluster state? (for de(/re)commission)
- And will the absence of the recommission/decommission state, will the APIs return the empty response? rather can we have decommission state (bool) as part of the payload itself for the GET calls?
@saikaranam-amazon could you help elaborate on why that might be needed?
sure @Bukhtawar update above.
Thanks @imRishN .
should we have Zone Decommission
API synchronous ? The Get commission status
API essentially will output the state preserved in cluster state. By making API sync , the status API need can be avoided .
If users want to know the status of decommission
, _cat/nodes
, _cat/master
and _cluster/health
should be able to provide the granular details .
When we decommission the zone it is possible that traffic hasn't be DRAINED in which case it might take longer and calls getting timed out. The GET
API can perform more exhaustive checks on traffic drain, there could be more graceful checks in future around ongoing snapshots, shard relocation etc which would take time to complete. Having a GET
API would help extend for other cases as well.
@Bukhtawar : This makes sense . But do you think we can start without it for now and iterate on it later based on the need ?
In the case where traffic hasn't be DRAINED, we would return the API call , with the reasons for the same. A user can call the APIs with a lower timeout and see the reasons for that getting stalled. As and when we add more checks around snapshots, shard relocation , that will automatically get added to the reasons as well .
Can we add labels for "roadmap" and the version of OpenSearch this is targeting? I can add it to the overall project roadmap in the right column once that is done.
Regarding the Put
call for Decommission API
{
"status": "PROCESSING | DRAINING | COMPLETED",
"awareness": {
"zone": "zone-A"
}
}
How are responding regarding failures in executing the call? - ( May be let's track FAILED
as a status value. )
Regarding the Get
call for commission status
{
"awareness": {
"zone": "zone-A",
"zone": "zone-B",
"zone": "zone-C"
}
}
Can we have the list of values under zones
as key? - and are we not foreseeing any async operations that can be captured under the status
field similar to Decommission call?
How are responding regarding failures in executing the call? - ( May be let's track FAILED as a status value. )
Updated the API contract above, including FAILED
as a status value
Can we have the list of values under zones as key? - and are we not foreseeing any async operations that can be captured under the status field similar to Decommission call?
This makes sense to have a list under the same key. Updated the details
Should we change _local
to local
in GET _cluster/shard_routing/weights?_local
to keep it consistent with other _cluster/* apis like _cluster/state.
Updated the API structures above
@reta @dblock @elfisher any thoughts on above APIs?
Thanks for summarizing the API design @imRishN , I personally see large disconnect between the existing routing awareness and suggested decomissioning / recommissioning API.
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.awareness.attributes": "zone",
"cluster.routing.allocation.awareness.force.zone.values":["zoneA", "zoneB"]
}
}
Logically, it looks to me that decomissioning
== remove zone from cluster.routing.allocation.awareness.force.zone.values
whereas recomissioning
== reintroduce zone cluster.routing.allocation.awareness.force.zone.values
.
The weights could be modeled in the similar fashion using "cluster.routing.allocation.awareness.force.zone.weights" : [1, 0, 0]
setting.
I have nothing against introducing dedicated APIs but it is going to be difficult and confusing to maintain the API/settings split. Also, one important thing to keep in mind is that cluster settings could be persistent
or transient
, I believe it equally applies to decomissioning / recommissioning, for example when people rescale clusters withing same zones - the settings could be set to temporarily exclude some zones and reintroduce them back after restart.
Does it make sense or I completely derailed the conversation?
@reta, thanks for taking a look into the RFC. The cluster settings that you mentioned above is more of shard allocation strategy based on the awareness attribute set to the cluster.
As part of decommissioning an awareness attribute, we intend to remove the nodes from the cluster during zonal outages as it might be operating in a degraded manner and impacting the overall cluster's availability. Today, any write request requires a response from all the shard copies before the request is acknowledged. During zonal outages, this model can impact the writes to the cluster as any slow copy or impairment will slow down the writes significantly. The API design gives the user flexibility to remove the nodes present in an impacted zone out from the cluster and mark shards there as unavailable.
Logically, it looks to me that decomissioning == remove zone from cluster.routing.allocation.awareness.force.zone.values whereas recomissioning == reintroduce zone cluster.routing.allocation.awareness.force.zone.values.
We don't need to remove the zone from force zone values as it might trigger a storm of shard recoveries impacting latencies due to additional CPU and network consumption. We will let shard stay in UNASSIGNED state after decommissioning the zone. During recovery, the user can decide on recommissioning the zone back again.
More details on recommission and decommissioning a zone can be found here #3402
@imRishN aha, I see, thanks for clarification, I think I have even more questions, this time regarding the API:
PUT /_cluster/decommission/awareness/<zone>/<zone-a>
DELETE /_cluster/decommission/awareness/<zone>/<zone-a>
The <zone>
seems to be off here, what we need is the awareness attribute (which could be zone
), so the APIs could be generalized this way:
PUT /_cluster/decommission/awareness/<attribute>/<value>
DELETE /_cluster/decommission/awareness/<<attribute>/<value>
And in case of zone
attribute:
PUT /_cluster/decommission/awareness/zone/<zone-a>
DELETE /_cluster/decommission/awareness/zone/<zone-a>
Regarding weights
, why we are introducing shard_routing
?
GET /_cluster/shard_routing/weights?local
PUT /_cluster/shard_routing/weights
The termilogy we settled upon is just routing
so I think we should stick to that?
GET /_cluster/routing/weights?local
PUT /_cluster/routing/weights
Even better (arguably) approach is to follow decommission/decommission and design something like this:
GET /_cluster/routing/awareness/<attribute>/weights?local
PUT /_cluster/routing/awareness/<attribute>weights
WDYT?
@reta,
The
seems to be off here, what we need is the awareness attribute (which could be zone), so the APIs could be generalized this way: PUT /_cluster/decommission/awareness/{attribute}/{value} DELETE /_cluster/decommission/awareness/{attribute}/{value}
That's correct. The API will take in the awareness attribute set to the cluster by the setting cluster.routing.allocation.awareness.attributes
and will be validating against the value this setting has. zone
was an example here. I have created a draft PR implementing the same #4261
@reta, thanks for the suggestions. I have updated the API path for weights as well. Let me know if this looks good to you?
@reta, thanks for the suggestions. I have updated the API path for weights as well. Let me know if this looks good to you?
Thanks @imRishN , it looks concise to me (minor typo with missed slash, PUT /_cluster/routing/awareness/<attribute>weights
-> PUT /_cluster/routing/awareness/<attribute>/weights
)
@reta, thanks for pointing out. Updated above.
Default looks like - {"msg":"Weights are not set"}. Shouldn't we just return empty object {}
@imRishN I am not completely sure on the URL pattern for decommission. We are decommissioning nodes the cluster as part of this API but not clear from the URL.
This is what I think it should be:
PUT /_nodes/awareness/<zone>/<zone-a>/_decommission
@sachinpkale this is another way to look at it, but I believe we aim to decommission an awareness attribute value (zone
in this case), which means - everything related to this zone, including nodes (which is probably the only thing we decommission). But I see your point, it has valid concerns
Default looks like - {"msg":"Weights are not set"}. Shouldn't we just return empty object {}
Will make the response empty object
@sachinpkale Although, the attribute key value is a node property, but awareness in general is a cluster property. This is how we set the awareness attribute to the cluster - "cluster.routing.allocation.awareness.attributes": "zone"
.
Also, I feel, /_nodes/awareness/<zone>/<zone-a>/_decommission
will create confusion when we extend this feature to decommission only a set of nodes from a particular zone, or for that matter any such extensions. Also, I'll create another issue for implementing nodes decommission irrespective of awareness attribute. And in case if we purse it further then /_cluster/decommission
can be simply used as path prefix. Let me know if this clears your doubt or do you still see concerns with the above APIs
Closing this issue as ALL the API PRs are merged now