OpenSearch
OpenSearch copied to clipboard
[RFC] Search Query Sandboxing: User Experience
Co-Author @kkhatua
#8879 - [RFC] High Level Vision for Core Search in OpenSearch #11061 - [RFC] Query Sandboxing for search requests
A common challenge with managing resources on OpenSearch clusters has been keeping runaway queries in check. With Search Backpressure, the ability to avoid running out of resources is now available, but there is no capability of protecting tenants who might be unfairly penalized. The goal of this RFC is to propose a mechanism for how admin users will be able to organize tenants into different groups (aka Sandboxes), and limit the cumulative resource utilization of these groups. We will mention an idea of how the sandbox is enforced, but will likely be a separate RFC due to its complexity.
What is a Sandbox ?
A sandbox is a logic construct designed for running search requests within the virtual limits. The sandboxing service will track and aggregate a node's resource usage for different sandboxes, and based on the mode of containment, limit or terminate tasks of a sandbox that's exceeding its limits. A sandbox's definition is stored in cluster state, hence the limits that need to be enforced are available to nodes across the cluster.
Tenets
- A sandbox can provide constraints on one or more resources.
- A user may hint preference for a sandbox
- A request can map to only one sandbox
- A request that maps to multiple sandboxes will be assigned to a sandbox with the highest similarity score
- A request maps to no sandbox or multiple sandboxes with the same similarity score will be assigned to a system-defined catchAll sandbox called genPop (general population).
- The mapping attributes of a sandbox must be unique as compared to other sandboxes
- The sandbox thresholds for a resource cumulatively cannot exceed 100%
Schemas
Sandbox Definition
The following is an abstract example of a sandbox’s definition, and is broadly broken into 4 essential elements within the document
{
"name": "<nameOfSandbox>",
"attributes": {
"attributeA": "<condition1,condition2>",
...
},
"resources": {
"resourceX": {
"allocation": "<floatValue:(0-1)>",
"threshold": "<floatValue:(0-1)>"
},
...
},
"enforcement": "<monitor|soft|hard>"
}
- name : defines a unique String identifier. Internally a UUID is used
- attributes : a collection of unique attributes that need to be matched by a request to be governed by that sandbox. We are considering user_name, indices_name, role, and custom_id
- resources : The set of unique resources for which this sandbox is being defined. We are considering jvm and cpu
- enforcement : How is the sandbox being enforced. This can be either monitor or soft or hard
Resource Definition
For each resource, a cluster level schema is also required, and the following is an abstract example
{
"resources": {
"resourceX": {
"threshold": {
"rejection" : "<floatValue:(0-1)>",
"cancellation" : "<floatValue:(0-1)>"
},
"enforcement": "<monitor|soft|hard>"
},
...
}
}
- resources : The limits of a resource within which all the sandboxes must cumulatively operate before requests get rejected or cancelled. rejectionThreshold <= cancellationThreshold
- enforcement : How the containment of sandboxes for this resource will be managed. This can be either monitor or soft or hard , and can potentially override a sandbox’s enforcement.
High Level Flow
A request landing on a coordinator node will first need to be mapped to a sandbox, as per the tenets. Once a sandbox has been mapped, all child tasks spawned from the request will also inherit the sandbox allocation, irrespective on which node it runs.
Sandbox Resolution
The sandbox resolution happens on the coordinator node and will persist with all the tasks (and child tasks) of that request for the entirety of its lifecycle.
Thresholds Enforcement
As the sandboxes are enforced at a node level, for each resource, there are 2 thresholds defined cluster-wide:
- rejection — Reject new tasks if this threshold is breached
- cancellation — Cancel inflight tasks if this threshold is breached
Each Sandbox level thresholds are always proportional to the node level thresholds
The following is an example where
- 2 sandboxes have 20% and 25% thresholds for a resource, and the rest (100% - 20% - 25% = 55% ) is part of the general population (aka genPop) sandbox.
- If a node exceeds 85% usage for a resource, it will actively reject new tasks destined for the overflowing sandboxes.
- If the resource usage exceeds 95%, the node will actively cancel inflight tasks on the overflowing sandboxes
Additional Context
This RFC aims to discuss ideas at a high level. More details are provided in this google doc. Anyone with the link has comment access and we would love to gather feedback.
@msfroh @dblock @reta @sohami @Bukhtawar @nknize would love your thoughts !
Could we piggy-back on the idea of "views" to enforce sandboxes? A view could be associated with a sandbox.
Alternatively, if we wanted to be more flexible, a view could enforce a search pipeline, then the search pipeline could (conditionally) modify the request to associate it with a sandbox.
My next few messages contain notes from Search Backlog and Triage meeting: https://www.meetup.com/opensearch/events/298411954/. The @ references refer to the people who said the respective lines.
- @kkhatua: This is a feature aimed at sysadmins. We want to allow them to segregate their user classes and ration resources per class.
- @kiranprakash154: Enforcement has three levels: Monitor, soft, and hard enforcement. Monitor just gives you stats. Soft lets queries in until the node is under duress. Hard limit starts canceling queries when the sandbox is exceeded, whether the node is in duress or not.
Example from the Google doc that helps clarify things more than the example above:
Two Sandboxes defined for memory and CPU
[
{
"name": "analysts",
"attributes": {
"role": "ba-*"
},
"resources": {
"jvm": {
"allocation": "0.5"
},
"cpu": {
"allocation": "0.3"
},
},
"enforcement": "soft"
},
{
"name": "operations",
"attributes": {
"role": "tellers",
"indices_name": "transactions-*"
},
"resources": {
"jvm": {
"allocation": "0.25"
}
},
"enforcement": "hard"
}
]
So I think the usage of the sandbox
concept is confusing here: we are not sandboxing anything (we cannot limit resources) but track the resource usage and react on that. Better names would be "resource limits group", "query prioritization group" or something along these lines.
Linking #1017 as it seems somewhat related
- Stavros: Is this here to prevent long-running queries? What if I have a long-running query and want to run it without impacting others. Can I suspend the query?
- @kkhatua -- Not really. Existing search back pressure will cancel the oldest query if the system is under duress. If another user comes in and spams a lot of queries, we may want to stop those queries and let the long-running query finish.
- @ansjcy: What do you mean by "usage"? Queries span multiple nodes. Which node do we measure usage from?
- @kiranprakash154: We assign a query ID when it comes in and track the query across all nodes.
- @jainankitk: We don't need to track resource usage across all nodes. If we have a 10% limit on CPU usage, we enforce that on every node.
- @reta: Sandboxing probably isn't the best name. It's more around limits and monitoring. @getsaurabh02: Should probably refer to it as query prioritization.
- @kiranprakash154: We map requests to sandboxes based on attributes. If a query maps to more than one sandbox, we map it to a default sandbox.
- @msfroh: Putting sandboxing preference in the hands of users seems unnecessarily complex. Only sysadmins should need to worry about resource utilization.
- @peternied: Users didn't want the rules in place. Operators want the rules in place.
- @kkhatua: What if users want to run a load test or something?
- @macrakis: This seems like a very low level mechanism. If I'm a sysadmin, I would like to guarantee some performance characteristics, like some queuing system. Classic operating system behavior. How are users supposed to know when they can run their long-running queries effectively?
- @peternied: Do we want to separate individual users?
- @macrakis: This is a bottom-up mechanism. We need to think about this top-down. How are users going to manage long-running queries? We don't have a way of putting queries to sleep, but maybe we need to think about how to do that in future. This feels like a very implementation-oriented approach, rather than a functionality-oriented approach.
- @kkhatua: Example usage is folks who change their dashboard to refresh every 30 seconds, down from 5 minutes.
- @macrakis: We can do that by limiting users to different levels. Allocate resources equitably between users. Each of 5 users should be treated equally well. What does equally mean? We can debate that. Maybe the 5th user runs a long-running query that can wait. We try to give them equal access to the system.
- @kkhatua: This is focused on containing users who can impact the system. Big users don't seem to be concerned with fair allocation of resources, but containing users who are overusing the system.
- @kkhatua: Expectation is that sysadmins will first run in monitoring mode, and then set soft/hard limits based on the observed steady state.
Thanks for the RFC @kiranprakash154 and notes @msfroh . Regarding the term "Sandbox" used in the proposal, I'd like to offer another perspective. The term might not fully capture the essence of our approach, as we are not actually pausing or decelerating query execution to manage system load, which is often associated with a sandbox environment. Instead, we're implementing a more granular strategy on the lines of query-level circuit breakers for various resources, such as CPU and memory. This method ensures more precise control over resource allocation.
Additionally, incorporating a priority system for query execution, along with the capability to cancel queries when resource usage exceeds certain thresholds, seems to align with the principles outlined in this proposal.
Thanks for writing this all up and taking the time to present the document today. The level of detail and consideration for low level scenarios shows through. In the context of advancing this proposal - I would recommend creating a proof of concept and publishing it as a draft pull request. This approach will not only help us visualize the proposed mechanisms but also enable a hands-on exploration of scenarios that significantly impact users.
As we moving forward I would suggest probing into the following areas:
- How much effort and understanding is needed by a sysadmin to enable and manage sandbox? The impact of a poorly tuned threshold can make the cluster unavailable, its important to experiment with how intuitive this process is especially with dynamically changing workloads.
- Clusters evolve over time by scaling horizontally (additional nodes) or vertically (larger instances), how does changing these constraints impact sandboxing. What needs to be built so that after a scaling event sandboxing works without manual intervention?
- There is overhead with additional monitoring systems and keeping 'head room' in thresholds that directly impacting billing. How can we make sure that the cluster is well tuned to be responsive without underutilization of resources?
- When an enforcement an action is performed (queries are rejected or canceled), the feedback loop for the query author is crucial. How are users informed of these actions and what guidance are they provided to help them adjust their queries?
- Specifically in OpenSearch Dashboards, how are these enforcement actions are communicated to the end-user is key to a smooth user experience. What methods will be used for graceful handling of these actions that avoiding user confusion?
@peternied - Thank you for reviewing the proposal and providing detailed feedback. Please find response to some of the questions below:
Clusters evolve over time by scaling horizontally (additional nodes) or vertically (larger instances), how does changing these constraints impact sandboxing. What needs to be built so that after a scaling event sandboxing works without manual intervention?
Scaling the cluster horizontally / vertically will not impact the sandboxing as constraints are such that they are agnostic to such events.
There is overhead with additional monitoring systems and keeping 'head room' in thresholds that directly impacting billing. How can we make sure that the cluster is well tuned to be responsive without underutilization of resources?
Underutilization is an interesting aspect and something we have talked about at length while coming up with the proposal. Every sandbox has option of soft enforcement mode that allows it to exceed the allocated quota assuming the node is not under duress. If the node is under duress, the sandbox will be hard enforced to its pre-allocated quota, since we want to prevent node from going down at any cost.
When an enforcement an action is performed (queries are rejected or canceled), the feedback loop for the query author is crucial. How are users informed of these actions and what guidance are they provided to help them adjust their queries?
While we will not provide any immediate guidance around adjusting the queries, the error message should inform the users clearly about the reason for rejection. In the later phases, we can leverage query insights for providing actionable feedback on making query more efficient.
Thanks for the presentation today.
I would like to understand how a sys admin would use sandboxes for the following scenarios:
Classic search on the web, e.g., e-commerce (see the Atlassian comment):
- I have a steady stream of small live queries (Web users) where I want to bound latency (mean < 100ms, P99 < 1000ms, Pmax < 2000ms).
- I mix in some longer-running, lower-priority queries which may use a lot of resources, but which can be re-run if they fail.
- I am also performing updates and I don’t want my search catalog to lag more than 10 sec behind the data feed.
Fairness in updating dashboards:
- I have 10 dashboards, where the user can specify update interval. If I can’t update them all within the update interval, I somehow need to fairly slow them all down – maybe by increasing the floor on the update interval on all of them. (But I also raise an alert.)
Mix of batch jobs:
- I run all batch jobs, and want them to run first-in, first-out. I assume that I’m running several in parallel at any one time, but when I need to abort one because others need more resources, I want to abort the most recent one.
Let me admit that – except for the first case – I have no practical experience with these scenarios. Are they realistic? Can we handle them?
-s
I echo my thoughts with @reta and had similar comments on labelling this as "sandbox" since this is more a tenant specific resource limit. While this is a good starting point, we need to see how we tie the bigger picture with other multi-tenant capabilities like
- Index level encryption
- Tenant specific shard placement for index patterns where users can choose to allocate indices on certain node groups for better query isolation
- Prioritised queue to run some queries in the background based on constraints.
Thanks @macrakis for providing these scenarios
- I have a steady stream of small live queries (Web users) where I want to bound latency (mean < 100ms, P99 < 1000ms, Pmax < 2000ms).
1st use case is more around end to end latency which gets affected due to multiple reasons and involve load characteristics on multiple nodes, while with this feature we are introducing node level resiliency so this doesn't come under the purview of this RFC.
- I mix in some longer-running, lower-priority queries which may use a lot of resources, but which can be re-run if they fail.
This one hints more towards async completion of cancellable queries at node level, this is something we can support as 2nd iteration of this. Piggybacking too many things on this RFC will only clutter up and complicate the delivery of this.
- I am also performing updates and I don’t want my search catalog to lag more than 10 sec behind the data feed.
This is something controlled using index.refresh_interval
setting if I am right, So this is already present.
Thanks @Bukhtawar! for taking time to provide your insightful thoughts on this. Naming convention is something we call can come to an agreement to use throughout.
Regarding tying this feature with other multi-tenant capabilities
Index level encryption
Not sure how does this one correlates with node level resiliency, can you be more specific ?
Tenant specific shard placement for index patterns where users can choose to allocate indices on certain node groups for better query isolation
But given this is a node level resiliency feature and given sharding mechanism distributes the data fairly well. I think this can potentially create the skewness in data distribution, Since we will track and limit the resource usage at node level for search workload then This will work fine as long as those query groups are not oversubscribing the resources. We can only isolate the search queries when the indexing traffic is not shared with search traffic.
Prioritised queue to run some queries in the background based on constraints.
I suppose you are hinting towards async completion of cancellable queries or do you mean something else ?
Since we are not introducing the priority concept for these query groups in the first iteration, We can take this up later.
Regarding naming convention instead of Sandbox for this feature, what should we go with (I can make few initial suggestions)
- QueryGroupResourceBucket
- QuerySquadron
- QueryConvoy
Hi @kaushalmahi12, @jainankitk, @kiranprakash154 I just wanted to check in over here.
I know we've chatted a bit about this and I want to share that I am neither for/nor against making this or any of the related changes from a code perspective at the moment. I went ahead and left some basic comments on @kaushalmahi12's draft and don't have too many suggestions at this point.
It seems like there is some confusion around the use case for this feature; while I appreciate how detailed this RFC is (wow!), I don't know if you can point to some other issues or comments which ask for something like this?
Looking this over again, several of us have pointed out that "sandbox" is probably the wrong name. Besides implying incorrectly that it has something to do with testing or with security, it doesn't explicitly mention that it is about resources or groups of users. I like @reta's suggestion of "resource limits group".
Based on the latest discussion on this, We have come to a conclusion that this problem consists of following independent subproblems
- Rule based matching/tagging of the incoming requests
- Sandbox need not have attributes defined as the attributes only serves the purpose of tagging the incoming requests. Hence it makes more sense to have that entity separate. **We will create two separate entities, 1. for enforcing the resource limits and 2. for defining the rules.
- Tracking and Monitoring (This will include cancellations)
Based on the discussion with folks we had decided to move the APIs to a plugin due to following reasons
- Not to pollute the core code with feature specific APIs
We will make it as a core plugin due to following reasons
- Since this feature sits majorly in core as the tracking and cancellation logic will reside in the core.
- Fundamental constructs for this feature
QueryGroup
is also residing in core because of the tracking and cancellation. - It would be hard to correlate the feature if the partial changes are in core and partial in plugin.
- We don;t expect iterative deleopment happening on the plugin side.
@kaushalmahi12 Should it be a module? If the reasons to separate it from the core code are for modularity/architectural reasons, but the majority of the feature actually sits in the core, then it sounds like a module might be a better fit. There are two major differences between a module and a plugin:
- Modules use the same classloader of core, whereas each plugin gets its own classloader
- Modules are installed by default with no reasonable way for users to opt-out of them
This plugin appears to have no new dependencies, so I'm not sure the classloader difference is important. The other point is an important difference though. Are there cases where you would not want this feature to be installed?
@andrross These are important facts to consider when making the decision to move the feature into a plugin/module. I think it might not make sense for all users since this feature will need enablement so the code might sit dead in the artifact. If that is fine with the community then It should be fine to move it to a module. @andrross What are your thoughts on this ?
code might sit dead in the artifact
Core plugins are in the min distribution artifact. They aren't loaded into the JVM unless installed, but the overhead of loading these classes is likely negligible.
feature will need enablement
If the act of installing the plugin is the enablement mechanism, then a plugin is probably the right place because then you'd have to build some other enabling mechanism.
@msfroh What do you think regarding plugin vs module here?