[BUG] Circuit broken when too many concurrent PPL queries are run
Query Information
PPL Command/Query:
source=opensearch_dashboards_sample_data_flights
| where Carrier != 'Logstash Airways' AND timestamp > '2024-01-01'
| eval price_distance_ratio = AvgTicketPrice / DistanceKilometers, delay_factor = if(FlightDelayMin > 0, FlightDelayMin, 0), weather_score = if(DestWeather = 'Sunny', 5, if(DestWeather = 'Clear', 4, if(DestWeather = 'Partly Cloudy', 3, if(DestWeather = 'Cloudy', 2, 1))))
| stats sum(AvgTicketPrice) as total_revenue, avg(DistanceKilometers) as avg_distance, max(FlightDelayMin) as max_delay, avg(price_distance_ratio) as avg_price_per_km, sum(delay_factor) as total_delay_factor, avg(weather_score) as avg_weather_score by Carrier, OriginCountry, DestCountry
| eval efficiency_score = total_revenue / (avg_distance * (1 + total_delay_factor) * (6 - avg_weather_score))
| sort -efficiency_score, +avg_distance, -total_revenue
| eval rank = 1
| fields Carrier, OriginCountry, DestCountry, total_revenue, avg_distance, max_delay, avg_price_per_km, efficiency_score, rank
| sort -AvgTicketPrice, +DistanceKilometers, -DistanceMiles
| head 1000000
Expected Result: The above query on its own, run against OSD sample flights data, succeeds. When 300 concurrent connections to the cluster were made, and a few hundred of these requests are made per connection, a more explicit error that doesn't trip a circuit breaker is expected.
Actual Result: The following error is received eventually:
Failed to execute phase [query], all shards failed; shardFailures {[cSDPaLn9RHOEMod0Qwh_JQ][opensearch_dashboards_sample_data_flights][0]: RemoteTransportException[[7167b47a0672ab1d56512d6e68d40c67][x.x.x.x:9300][indices:data/read/search[phase/query]]]; nested: QueryShardException[failed to create query: Failed to compile inline script [{\"langType\":\"v2\",\"script\":\"rO0ABXNyADRvcmcub3BlbnNlYXJjaC5zcWwuZXhwcmVzc2lvbi5mdW5jdGlvbi5GdW5jdGlvbkRTTCQzHWCy3iOeynUCAAVMAA12YWwkYXJndW1lbnRzdAAQTGphdmEvdXRpbC9MaXN0O0wADHZhbCRmdW5jdGlvbnQAQExvcmcvb3BlbnNlYXJjaC9zcWwvZXhwcmVzc2lvbi9mdW5jdGlvbi9...omitted for brevity...\"}] using lang [opensearch_compounded_script]]; nested: CircuitBreakingException[[script] Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.context.filter.max_compilations_rate] setting]; }; CircuitBreakingException[[script] Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.context.filter.max_compilations_rate] setting]"
It also trips a circuit breaker that downs at least the PPL APIs for a few minutes.
Dataset Information
Dataset/Schema Type
- [ ] OpenTelemetry (OTEL)
- [ ] Simple Schema for Observability (SS4O)
- [ ] Open Cybersecurity Schema Framework (OCSF)
- [x] Custom (details below)
Index Mapping
{
"opensearch_dashboards_sample_data_flights": {
"mappings": {
"properties": {
"AvgTicketPrice": {
"type": "float"
},
"Cancelled": {
"type": "boolean"
},
"Carrier": {
"type": "keyword"
},
"Dest": {
"type": "keyword"
},
"DestAirportID": {
"type": "keyword"
},
"DestCityName": {
"type": "keyword"
},
"DestCountry": {
"type": "keyword"
},
"DestLocation": {
"type": "geo_point"
},
"DestRegion": {
"type": "keyword"
},
"DestWeather": {
"type": "keyword"
},
"DistanceKilometers": {
"type": "float"
},
"DistanceMiles": {
"type": "float"
},
"FlightDelay": {
"type": "boolean"
},
"FlightDelayMin": {
"type": "integer"
},
"FlightDelayType": {
"type": "keyword"
},
"FlightNum": {
"type": "keyword"
},
"FlightTimeHour": {
"type": "keyword"
},
"FlightTimeMin": {
"type": "float"
},
"Origin": {
"type": "keyword"
},
"OriginAirportID": {
"type": "keyword"
},
"OriginCityName": {
"type": "keyword"
},
"OriginCountry": {
"type": "keyword"
},
"OriginLocation": {
"type": "geo_point"
},
"OriginRegion": {
"type": "keyword"
},
"OriginWeather": {
"type": "keyword"
},
"dayOfWeek": {
"type": "integer"
},
"timestamp": {
"type": "date"
}
}
}
}
}
Sample Data
{
"FlightNum": "9HY9SWR",
"DestCountry": "AU",
"OriginWeather": "Sunny",
"OriginCityName": "Frankfurt am Main",
"AvgTicketPrice": 841.2656419677076,
"DistanceMiles": 10247.856675613455,
"FlightDelay": false,
"DestWeather": "Rain",
"Dest": "Sydney Kingsford Smith International Airport",
"FlightDelayType": "No Delay",
"OriginCountry": "DE",
"dayOfWeek": 0,
"DistanceKilometers": 16492.32665375846,
"timestamp": "2025-09-29T00:00:00",
"DestLocation": {
"lat": "-33.94609833",
"lon": "151.177002"
},
"DestAirportID": "SYD",
"Carrier": "OpenSearch Dashboards Airlines",
"Cancelled": false,
"FlightTimeMin": 1030.7704158599038,
"Origin": "Frankfurt am Main Airport",
"OriginLocation": {
"lat": "50.033333",
"lon": "8.570556"
},
"DestRegion": "SE-BD",
"OriginAirportID": "FRA",
"OriginRegion": "DE-HE",
"DestCityName": "Sydney",
"FlightTimeHour": 17.179506930998397,
"FlightDelayMin": 0
}
Bug Description
Issue Summary: Running a high volume of concurrent queries against the cluster should throw a more explicit validation or rate limiting exception that doesn't cause a hiccup in the PPL plugin APIs.
Impact: The above error led to unavailability of at least the PPL plugin and its APIs
Environment Information
OpenSearch Version: OpenSearch 3.1
@penghuo Any idea if OpenSearch can support better rate limiting?
Hi @songkant-aws. Should this be resolved by https://github.com/opensearch-project/sql/pull/4615?
@toepkerd Could you provide your cluster settings related to script service configuration? Do you run the same query with 300 concurrency or different queries?
@RyanL1997 This is a different problem. While #4615 addressed the issue of a single script being too large, the error message here suggests that too many scripts are being generated too quickly. It appears the script service has a configuration limit on the script compilation throughput.
@penghuo Any idea if OpenSearch can support better rate limiting?
In alerting use cases, users configure alerts to trigger at fixed intervals. This means the query itself remains constant, with only the timestamp filter changing over time. We should consider caching the compiled query to improve performance.
In alerting use cases, users configure alerts to trigger at fixed intervals. This means the query itself remains constant, with only the timestamp filter changing over time. We should consider caching the compiled query to improve performance.
Does it mean these queries will have pattern:
source=opensearch_dashboards_sample_data_flights
| where Carrier != 'Logstash Airways' AND timestamp > 'xxxx-xx-xx'
...
...
Actually the above queries will produce DSL with only 3 static scripts in aggregation builder while no script in query builder:
{
"from": 0,
"size": 0,
"timeout": "1m",
"query": {
"bool": {
"must": [
{
"bool": {
"must": [
{
"exists": {
"field": "Carrier",
"boost": 1
}
}
],
"must_not": [
{
"term": {
"Carrier": {
"value": "Logstash Airways",
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
{
"range": {
"timestamp": {
"from": "2024-01-04T00:00:00.000Z",
"to": null,
"include_lower": false,
"include_upper": true,
"format": "date_time",
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"_source": {
"includes": [
"AvgTicketPrice",
"Carrier",
"FlightDelayMin",
"timestamp",
"DistanceKilometers",
"DestWeather",
"OriginCountry",
"DestCountry"
],
"excludes": []
},
"aggregations": {
"composite_buckets": {
"composite": {
"size": 10000,
"sources": [
{
"Carrier": {
"terms": {
"field": "Carrier",
"missing_bucket": true,
"missing_order": "first",
"order": "asc"
}
}
},
{
"OriginCountry": {
"terms": {
"field": "OriginCountry",
"missing_bucket": true,
"missing_order": "first",
"order": "asc"
}
}
},
{
"DestCountry": {
"terms": {
"field": "DestCountry",
"missing_bucket": true,
"missing_order": "first",
"order": "asc"
}
}
}
]
},
"aggregations": {
"total_revenue": {
"sum": {
"field": "AvgTicketPrice"
}
},
"avg_distance": {
"avg": {
"field": "DistanceKilometers"
}
},
"max_delay": {
"max": {
"field": "FlightDelayMin"
}
},
"avg_price_per_km": {
"avg": {
"script": {
"source": "{\"langType\":\"calcite\",\"script\":\"..."}",
"lang": "opensearch_compounded_script",
"params": {
"utcTimestamp": 1762410257420227000
}
}
}
},
"total_delay_factor": {
"sum": {
"script": {
"source": "{\"langType\":\"calcite\",\"script\":\"...."}",
"lang": "opensearch_compounded_script",
"params": {
"utcTimestamp": 1762410257420227000
}
}
}
},
"avg_weather_score": {
"avg": {
"script": {
"source": "{\"langType\":\"calcite\",\"script\":\"..."}",
"lang": "opensearch_compounded_script",
"params": {
"utcTimestamp": 1762410257420227000
}
}
}
}
}
}
}
}
And per my local test, if we submit queries in sequence with only change on their timestamp filter, the scripts in aggregation builder will be the same and hit the cache. @penghuo
@penghuo Do you have queries from alert team? It should trigger script push down in query builder and including the timestamp field in the script.
This issue, with the specific query in description, can only be reproduced on v2 engine, while always fine on calcite engine even under high concurrency query scenario.
For v2 engine, it can only push down the filter but not the aggregation after that. Its final plan produces request with filter script , and filter scripts vary among different query requests although the queries are the same. It's because v2 engine encodes the current timestamp into its script which will always change.
For calcite engine, it pushes both filter and aggregation, while only produces script for aggregation push down, see the DSL in https://github.com/opensearch-project/sql/issues/4584#issuecomment-3495298866. And the scripts for aggregation are static and so different requests can share the cache, even in high concurrency scenario.
@toepkerd Could u explain how does this issue triggered? What is OS version?
@qianheng-aws quick question, do we also optimize by parametrizing field names when something has to be applied on multiple fields?
@qianheng-aws quick question, do we also optimize by parametrizing field names when something has to be applied on multiple fields?
we don't do that yet but I'm writing this RFC to support some features including parametrizing field names.