performance-analyzer-rca icon indicating copy to clipboard operation
performance-analyzer-rca copied to clipboard

[FEATURE] Add support to recommend threshold tuning for heap based task cancellations by SearchBackpressureService

Open kaushalmahi12 opened this issue 1 year ago • 1 comments

Is your feature request related to a problem?

Recently opensearch introduced a new feature called searchbackpressure to make the service more resilient to node drops and performance degradation. It solves the problem by cancelling resource guzzling search queries at shard level and coordinator node level. In order to achieve this it uses various settings to cancel a search query based on the resource the query is making heavy use of. As part of this feature we will try to add support to recommend threshold tuning for those settings for heap based query cancellation at shard and coordinator level.

What solution would you like?

Since there are multiple settings for each resource based cancellation. We will only recommend a single value (a multiplier) by which the thresholds should increase/decrease for a resource(In this case heap) as that would complicate the solution and number of RCAs we will need to create. We will emit actions for both the searchTask(Coordinator) and shard level differently.

Logic to mark the RCA unhealthy to increase the thresholds (Node level)

  • If the max heap used by openSearch process is below 85% for a minute. Since RCA runs at 5 seconds interval, we will keep a sliding window of heapUsed values for a minute.
  • And the heap based task cancellations are more than 3%. (Since there are rate limiters to limit the amount of cancellations. Can't cancel more than 10% of all successful tasks both at shard level and coordinator level).

Logic to mark the RCA unhealthy to decrease the thresholds (Node level)

  • If the max heap used by openSearch process is above 90% for a minute. Since RCA runs at 5 seconds interval, we will keep a sliding window of heapUsed values for a minute.
  • And the heap based task cancellations are less than 3%. (Since there are rate limiters to limit the amount of cancellations. Can't cancel more than 10% of all successful tasks both at shard level and coordinator level).

Marking the cluster level RCAs unhealthy

We will mark the cluster level RCA as unhealthy if any of the node in the cluster has unhealthy node level RCA for an hour with a cool off period of one day.

Adjusted SBP Settings

  • search_backpressure.search_task.total_heap_percent_threshold
  • search_backpressure.search_task.heap_percent_threshold
  • search_backpressure.search_task.heap_variance
  • search_backpressure.search_task.heap_moving_average_window_size

What alternatives have you considered? The RCA framework is already in place to which runs as a side car and does not share the opensearch process resources. The alternate solution could have been to place this logic in the opensearch but that can create the resource scarcity and performance degradation of opensearch process under duress

Do you have any additional context? Add any other context or screenshots about the feature request here.

kaushalmahi12 avatar Jul 14 '23 17:07 kaushalmahi12