[RFC] Support Percentile in PPL
Is your feature request related to a problem? This RFC is a part of this issue: https://github.com/opensearch-project/sql/issues/44 Currently, PPL can not answer question with percentile function. For example
"source=opensearch_dashboards_sample_data_ecommerce| stats percentile<50>(taxless_total_price)"
returns Unsupported aggregation function
"{\n \"error\": {\n \"reason\": \"Invalid Query\",\n \"details\": \"Unsupported aggregation function percentile\",\n \"type\": \"SemanticCheckException\"\n },\n \"status\": 400\n}"
SQL query with percentiles function works as below because that it fallbacks to legacy engine which is not support PPL, so does current JOIN syntax.
SELECT percentiles(taxful_total_price) FROM opensearch_dashboards_sample_data_ecommerce
PERCENTILE is a common aggregate function which has a lot of cases needs in Visualization with PPL. Same feature request from community: https://github.com/opendistro-for-elasticsearch/sql/issues/1093
What solution would you like?
Syntax in PPL
Option 1 (defined in current OpenSearchPPLParser.g4, but not implemented in code)
stats percentile<quantile>(aggField)
stats percentile_approx<quantile>(aggField)
Option 2 (more readable and widely used in OLAP engines)
stats percentile(aggField, quantile)
stats percentile_approx(aggField, quantile)
- quantile: The bounds of quantile must be in [0, 100]
Syntax in SQL
Basic
PERCENTILE(functionArg, quantile)
PERCENTILE_APPROX(functionArg, quantile)
ANSI SQL (experimental)
PERCENTILE_CONT(quantile) WITHIN GROUP [ORDER BY expression [ASC | DESC]]
PERCENTILE_DISC(quantile) WITHIN GROUP [ORDER BY expression [ASC | DESC]]
- Window function:
adding OVER ([PARTITION BY expression])
For example, mainstream database supports percentile_cont and percentile_disc
Postgresql
https://www.postgresql.org/docs/9.4/functions-aggregate.html
RedShift
https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html
Snowflake
https://docs.snowflake.com/en/sql-reference/functions/percentile_disc.html
Spark
https://issues.apache.org/jira/browse/SPARK-37691
Solution
To align with the current percentiles(aggField) implementation in legacy engine and percentiles agg in OpenSearch core, for percentile we will use t-digest construction algorithm which is an approximate calculation.
For percentile_cont and percentile_disc, we could use org.apache.commons.math3:Percentile instead. For example, setting EstimationType R_1 for percentile_disc and R_7 for percentile_cont. This two R_x quantile algorithms are very popular, which used in Spark, PostgreSQL and Excel, etc. Reference: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html
What alternatives have you considered?
No, due to percentiles in legacy SQL engine couldn't work in PPL.
Do you have any additional context? Add any other context or screenshots about the feature request here.
PoC: https://github.com/LantaoJin/search-plugins-sql/tree/poc/percentile
cc @dai-chen, @penghuo
Option 2 (more readable and widely used in OLAP engines)
I perfer option2
To align with the current percentiles(aggField) implementation in legacy engine and percentiles agg in OpenSearch core, for percentile we will use t-digest construction algorithm which is an approximate calculation.
Do u proposal add another aggregator? Currently, SQL aggregation framework executed query plan on coordination node (if it can not be push down to OpenSearch). In future, we want to leverage Spark aggregation framework, instead of re-invent on it.
For percentile_cont and percentile_disc, we could use org.apache.commons.math3:Percentile instead. For example, setting EstimationType R_1 for percentile_disc and R_7 for percentile_cont. This two R_x quantile algorithms are very popular, which used in Spark, PostgreSQL and Excel, etc. Reference: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html
Is it possible to implement it in OpenSearch Core? and SQL/PPL can leverage it.