bigquery-emulator
bigquery-emulator copied to clipboard
Inconsistent behavior using PERCENTILE_CONT function
Been trying to write some tests using the bigquery-emulator and so far so good! Until we ran into some strange behavior. Please see the cases from using the bq cli tool, but the results are the same querying via the bigquery API:
Working cases:
-
The zetasqlite/GCP documentation PERCENTILE_CONT test case
-
Simple 0-100 test case
bq --api http://0.0.0.0:9050 query --project_id=bq-test "WITH cte AS
(SELECT 0 as age UNION ALL SELECT 50 as age UNION ALL SELECT 100 as age)
SELECT
PERCENTILE_CONT(age, 0) OVER() min,
PERCENTILE_CONT(age, 0.01) OVER() percentile1,
PERCENTILE_CONT(age, 0.5) OVER() median,
PERCENTILE_CONT(age, 0.99) OVER() percentile99,
PERCENTILE_CONT(age, 1) OVER() max,
FROM cte LIMIT 1"
+-----+-------------+--------+--------------+-----+
| min | percentile1 | median | percentile99 | max |
+-----+-------------+--------+--------------+-----+
| 0 | 1 | 50 | 99 | 100 |
+-----+-------------+--------+--------------+-----+
Broken cases:
- Given 20, 30, and 40
Expected:
+-----+-------------+--------+--------------+-----+
| min | percentile1 | median | percentile99 | max |
+-----+-------------+--------+--------------+-----+
|20.0 | 20.2 | 30.0 | 39.8 | 40.0|
+-----+-------------+--------+--------------+-----+
Actual
bq --api http://0.0.0.0:9050 query --project_id=bq-test "WITH cte AS
(SELECT 20 as age UNION ALL SELECT 30 as age UNION ALL SELECT 40 as age)
SELECT
PERCENTILE_CONT(age, 0) OVER() min,
PERCENTILE_CONT(age, 0.01) OVER() percentile1,
PERCENTILE_CONT(age, 0.5) OVER() median,
PERCENTILE_CONT(age, 0.99) OVER() percentile99,
PERCENTILE_CONT(age, 1) OVER() max,
FROM cte LIMIT 1"
+-----+-------------+--------+--------------+-----+
| min | percentile1 | median | percentile99 | max |
+-----+-------------+--------+--------------+-----+
| 0 | 0.2 | 10 | 19.8 | 20 |
+-----+-------------+--------+--------------+-----+
- Given 500, 50, and 100
Expected:
+-----+-------------+--------+--------------+-----+
| min | percentile1 | median | percentile99 | max |
+-----+-------------+--------+--------------+-----+
|50.0 | 51.0 | 100.0 | 492.0 |500.0|
+-----+-------------+--------+--------------+-----+
Actual:
bq --api http://0.0.0.0:9050 query --project_id=bq-test "WITH cte AS
(SELECT 500 as age UNION ALL SELECT 50 as age UNION ALL SELECT 100 as age)
SELECT
PERCENTILE_CONT(age, 0) OVER() min,
PERCENTILE_CONT(age, 0.01) OVER() percentile1,
PERCENTILE_CONT(age, 0.5) OVER() median,
PERCENTILE_CONT(age, 0.99) OVER() percentile99,
PERCENTILE_CONT(age, 1) OVER() max,
FROM cte LIMIT 1"
+-----+-------------+--------+--------------+-----+
| min | percentile1 | median | percentile99 | max |
+-----+-------------+--------+--------------+-----+
| 0 | 4.5 | 225 | 445.5 | 450 |
+-----+-------------+--------+--------------+-----+
It appears the percentiles are being calculated using the range (max - min) multiplied by the percentile.
And the working cases conveniently work because they have a min value of 0?
@heyweswu can this be closed? Did your PR fix the issue?