sql icon indicating copy to clipboard operation
sql copied to clipboard

[BUG] Circuit broken when too many concurrent PPL queries are run

Open toepkerd opened this issue 2 months ago • 11 comments

Query Information

PPL Command/Query:

source=opensearch_dashboards_sample_data_flights
| where Carrier != 'Logstash Airways' AND timestamp > '2024-01-01'
| eval price_distance_ratio = AvgTicketPrice / DistanceKilometers, delay_factor = if(FlightDelayMin > 0, FlightDelayMin, 0), weather_score = if(DestWeather = 'Sunny', 5, if(DestWeather = 'Clear', 4, if(DestWeather = 'Partly Cloudy', 3, if(DestWeather = 'Cloudy', 2, 1))))
| stats sum(AvgTicketPrice) as total_revenue, avg(DistanceKilometers) as avg_distance, max(FlightDelayMin) as max_delay, avg(price_distance_ratio) as avg_price_per_km, sum(delay_factor) as total_delay_factor, avg(weather_score) as avg_weather_score by Carrier, OriginCountry, DestCountry
| eval efficiency_score = total_revenue / (avg_distance * (1 + total_delay_factor) * (6 - avg_weather_score))
| sort -efficiency_score, +avg_distance, -total_revenue
| eval rank = 1
| fields Carrier, OriginCountry, DestCountry, total_revenue, avg_distance, max_delay, avg_price_per_km, efficiency_score, rank
| sort -AvgTicketPrice, +DistanceKilometers, -DistanceMiles
| head 1000000

Expected Result: The above query on its own, run against OSD sample flights data, succeeds. When 300 concurrent connections to the cluster were made, and a few hundred of these requests are made per connection, a more explicit error that doesn't trip a circuit breaker is expected.

Actual Result: The following error is received eventually:

Failed to execute phase [query], all shards failed; shardFailures {[cSDPaLn9RHOEMod0Qwh_JQ][opensearch_dashboards_sample_data_flights][0]: RemoteTransportException[[7167b47a0672ab1d56512d6e68d40c67][x.x.x.x:9300][indices:data/read/search[phase/query]]]; nested: QueryShardException[failed to create query: Failed to compile inline script [{\"langType\":\"v2\",\"script\":\"rO0ABXNyADRvcmcub3BlbnNlYXJjaC5zcWwuZXhwcmVzc2lvbi5mdW5jdGlvbi5GdW5jdGlvbkRTTCQzHWCy3iOeynUCAAVMAA12YWwkYXJndW1lbnRzdAAQTGphdmEvdXRpbC9MaXN0O0wADHZhbCRmdW5jdGlvbnQAQExvcmcvb3BlbnNlYXJjaC9zcWwvZXhwcmVzc2lvbi9mdW5jdGlvbi9...omitted for brevity...\"}] using lang [opensearch_compounded_script]]; nested: CircuitBreakingException[[script] Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.context.filter.max_compilations_rate] setting]; }; CircuitBreakingException[[script] Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.context.filter.max_compilations_rate] setting]"

It also trips a circuit breaker that downs at least the PPL APIs for a few minutes.

Dataset Information

Dataset/Schema Type

  • [ ] OpenTelemetry (OTEL)
  • [ ] Simple Schema for Observability (SS4O)
  • [ ] Open Cybersecurity Schema Framework (OCSF)
  • [x] Custom (details below)

Index Mapping

{
  "opensearch_dashboards_sample_data_flights": {
    "mappings": {
      "properties": {
        "AvgTicketPrice": {
          "type": "float"
        },
        "Cancelled": {
          "type": "boolean"
        },
        "Carrier": {
          "type": "keyword"
        },
        "Dest": {
          "type": "keyword"
        },
        "DestAirportID": {
          "type": "keyword"
        },
        "DestCityName": {
          "type": "keyword"
        },
        "DestCountry": {
          "type": "keyword"
        },
        "DestLocation": {
          "type": "geo_point"
        },
        "DestRegion": {
          "type": "keyword"
        },
        "DestWeather": {
          "type": "keyword"
        },
        "DistanceKilometers": {
          "type": "float"
        },
        "DistanceMiles": {
          "type": "float"
        },
        "FlightDelay": {
          "type": "boolean"
        },
        "FlightDelayMin": {
          "type": "integer"
        },
        "FlightDelayType": {
          "type": "keyword"
        },
        "FlightNum": {
          "type": "keyword"
        },
        "FlightTimeHour": {
          "type": "keyword"
        },
        "FlightTimeMin": {
          "type": "float"
        },
        "Origin": {
          "type": "keyword"
        },
        "OriginAirportID": {
          "type": "keyword"
        },
        "OriginCityName": {
          "type": "keyword"
        },
        "OriginCountry": {
          "type": "keyword"
        },
        "OriginLocation": {
          "type": "geo_point"
        },
        "OriginRegion": {
          "type": "keyword"
        },
        "OriginWeather": {
          "type": "keyword"
        },
        "dayOfWeek": {
          "type": "integer"
        },
        "timestamp": {
          "type": "date"
        }
      }
    }
  }
}

Sample Data

{
          "FlightNum": "9HY9SWR",
          "DestCountry": "AU",
          "OriginWeather": "Sunny",
          "OriginCityName": "Frankfurt am Main",
          "AvgTicketPrice": 841.2656419677076,
          "DistanceMiles": 10247.856675613455,
          "FlightDelay": false,
          "DestWeather": "Rain",
          "Dest": "Sydney Kingsford Smith International Airport",
          "FlightDelayType": "No Delay",
          "OriginCountry": "DE",
          "dayOfWeek": 0,
          "DistanceKilometers": 16492.32665375846,
          "timestamp": "2025-09-29T00:00:00",
          "DestLocation": {
            "lat": "-33.94609833",
            "lon": "151.177002"
          },
          "DestAirportID": "SYD",
          "Carrier": "OpenSearch Dashboards Airlines",
          "Cancelled": false,
          "FlightTimeMin": 1030.7704158599038,
          "Origin": "Frankfurt am Main Airport",
          "OriginLocation": {
            "lat": "50.033333",
            "lon": "8.570556"
          },
          "DestRegion": "SE-BD",
          "OriginAirportID": "FRA",
          "OriginRegion": "DE-HE",
          "DestCityName": "Sydney",
          "FlightTimeHour": 17.179506930998397,
          "FlightDelayMin": 0
        }

Bug Description

Issue Summary: Running a high volume of concurrent queries against the cluster should throw a more explicit validation or rate limiting exception that doesn't cause a hiccup in the PPL plugin APIs.

Impact: The above error led to unavailability of at least the PPL plugin and its APIs

Environment Information

OpenSearch Version: OpenSearch 3.1

toepkerd avatar Oct 16 '25 13:10 toepkerd

@penghuo Any idea if OpenSearch can support better rate limiting?

songkant-aws avatar Oct 20 '25 07:10 songkant-aws

Hi @songkant-aws. Should this be resolved by https://github.com/opensearch-project/sql/pull/4615?

RyanL1997 avatar Oct 21 '25 17:10 RyanL1997

@toepkerd Could you provide your cluster settings related to script service configuration? Do you run the same query with 300 concurrency or different queries?

songkant-aws avatar Oct 22 '25 02:10 songkant-aws

@RyanL1997 This is a different problem. While #4615 addressed the issue of a single script being too large, the error message here suggests that too many scripts are being generated too quickly. It appears the script service has a configuration limit on the script compilation throughput.

songkant-aws avatar Oct 22 '25 02:10 songkant-aws

@penghuo Any idea if OpenSearch can support better rate limiting?

In alerting use cases, users configure alerts to trigger at fixed intervals. This means the query itself remains constant, with only the timestamp filter changing over time. We should consider caching the compiled query to improve performance.

penghuo avatar Nov 05 '25 17:11 penghuo

In alerting use cases, users configure alerts to trigger at fixed intervals. This means the query itself remains constant, with only the timestamp filter changing over time. We should consider caching the compiled query to improve performance.

Does it mean these queries will have pattern:

source=opensearch_dashboards_sample_data_flights
| where Carrier != 'Logstash Airways' AND timestamp > 'xxxx-xx-xx'
...
...

Actually the above queries will produce DSL with only 3 static scripts in aggregation builder while no script in query builder:

{
    "from": 0,
    "size": 0,
    "timeout": "1m",
    "query": {
        "bool": {
            "must": [
                {
                    "bool": {
                        "must": [
                            {
                                "exists": {
                                    "field": "Carrier",
                                    "boost": 1
                                }
                            }
                        ],
                        "must_not": [
                            {
                                "term": {
                                    "Carrier": {
                                        "value": "Logstash Airways",
                                        "boost": 1
                                    }
                                }
                            }
                        ],
                        "adjust_pure_negative": true,
                        "boost": 1
                    }
                },
                {
                    "range": {
                        "timestamp": {
                            "from": "2024-01-04T00:00:00.000Z",
                            "to": null,
                            "include_lower": false,
                            "include_upper": true,
                            "format": "date_time",
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    },
    "_source": {
        "includes": [
            "AvgTicketPrice",
            "Carrier",
            "FlightDelayMin",
            "timestamp",
            "DistanceKilometers",
            "DestWeather",
            "OriginCountry",
            "DestCountry"
        ],
        "excludes": []
    },
    "aggregations": {
        "composite_buckets": {
            "composite": {
                "size": 10000,
                "sources": [
                    {
                        "Carrier": {
                            "terms": {
                                "field": "Carrier",
                                "missing_bucket": true,
                                "missing_order": "first",
                                "order": "asc"
                            }
                        }
                    },
                    {
                        "OriginCountry": {
                            "terms": {
                                "field": "OriginCountry",
                                "missing_bucket": true,
                                "missing_order": "first",
                                "order": "asc"
                            }
                        }
                    },
                    {
                        "DestCountry": {
                            "terms": {
                                "field": "DestCountry",
                                "missing_bucket": true,
                                "missing_order": "first",
                                "order": "asc"
                            }
                        }
                    }
                ]
            },
            "aggregations": {
                "total_revenue": {
                    "sum": {
                        "field": "AvgTicketPrice"
                    }
                },
                "avg_distance": {
                    "avg": {
                        "field": "DistanceKilometers"
                    }
                },
                "max_delay": {
                    "max": {
                        "field": "FlightDelayMin"
                    }
                },
                "avg_price_per_km": {
                    "avg": {
                        "script": {
                            "source": "{\"langType\":\"calcite\",\"script\":\"..."}",
                            "lang": "opensearch_compounded_script",
                            "params": {
                                "utcTimestamp": 1762410257420227000
                            }
                        }
                    }
                },
                "total_delay_factor": {
                    "sum": {
                        "script": {
                            "source": "{\"langType\":\"calcite\",\"script\":\"...."}",
                            "lang": "opensearch_compounded_script",
                            "params": {
                                "utcTimestamp": 1762410257420227000
                            }
                        }
                    }
                },
                "avg_weather_score": {
                    "avg": {
                        "script": {
                            "source": "{\"langType\":\"calcite\",\"script\":\"..."}",
                            "lang": "opensearch_compounded_script",
                            "params": {
                                "utcTimestamp": 1762410257420227000
                            }
                        }
                    }
                }
            }
        }
    }
}

And per my local test, if we submit queries in sequence with only change on their timestamp filter, the scripts in aggregation builder will be the same and hit the cache. @penghuo

qianheng-aws avatar Nov 06 '25 06:11 qianheng-aws

@penghuo Do you have queries from alert team? It should trigger script push down in query builder and including the timestamp field in the script.

qianheng-aws avatar Nov 06 '25 06:11 qianheng-aws

This issue, with the specific query in description, can only be reproduced on v2 engine, while always fine on calcite engine even under high concurrency query scenario.

For v2 engine, it can only push down the filter but not the aggregation after that. Its final plan produces request with filter script , and filter scripts vary among different query requests although the queries are the same. It's because v2 engine encodes the current timestamp into its script which will always change.

For calcite engine, it pushes both filter and aggregation, while only produces script for aggregation push down, see the DSL in https://github.com/opensearch-project/sql/issues/4584#issuecomment-3495298866. And the scripts for aggregation are static and so different requests can share the cache, even in high concurrency scenario.

qianheng-aws avatar Nov 07 '25 07:11 qianheng-aws

@toepkerd Could u explain how does this issue triggered? What is OS version?

penghuo avatar Nov 07 '25 16:11 penghuo

@qianheng-aws quick question, do we also optimize by parametrizing field names when something has to be applied on multiple fields?

vamsimanohar avatar Nov 07 '25 20:11 vamsimanohar

@qianheng-aws quick question, do we also optimize by parametrizing field names when something has to be applied on multiple fields?

we don't do that yet but I'm writing this RFC to support some features including parametrizing field names.

qianheng-aws avatar Nov 10 '25 02:11 qianheng-aws