bigquery-emulator icon indicating copy to clipboard operation
bigquery-emulator copied to clipboard

Inconsistent behavior using PERCENTILE_CONT function

Open heyweswu opened this issue 2 years ago • 2 comments

Been trying to write some tests using the bigquery-emulator and so far so good! Until we ran into some strange behavior. Please see the cases from using the bq cli tool, but the results are the same querying via the bigquery API:

Working cases:

bq --api http://0.0.0.0:9050 query --project_id=bq-test "WITH cte AS
    (SELECT 0 as age UNION ALL SELECT 50 as age UNION ALL SELECT 100 as age)
SELECT
    PERCENTILE_CONT(age, 0) OVER() min,
    PERCENTILE_CONT(age, 0.01) OVER() percentile1,
    PERCENTILE_CONT(age, 0.5) OVER() median,
    PERCENTILE_CONT(age, 0.99) OVER() percentile99,
    PERCENTILE_CONT(age, 1) OVER() max,
FROM cte LIMIT 1"
+-----+-------------+--------+--------------+-----+
| min | percentile1 | median | percentile99 | max |
+-----+-------------+--------+--------------+-----+
|   0 |           1 |     50 |           99 | 100 |
+-----+-------------+--------+--------------+-----+

Broken cases:

- Given 20, 30, and 40

Expected:

+-----+-------------+--------+--------------+-----+
| min | percentile1 | median | percentile99 | max |
+-----+-------------+--------+--------------+-----+
|20.0 | 20.2        | 30.0   | 39.8         | 40.0|
+-----+-------------+--------+--------------+-----+

Actual

bq --api http://0.0.0.0:9050 query --project_id=bq-test "WITH cte AS
    (SELECT 20 as age UNION ALL SELECT 30 as age UNION ALL SELECT 40 as age)
SELECT
    PERCENTILE_CONT(age, 0) OVER() min,
    PERCENTILE_CONT(age, 0.01) OVER() percentile1,
    PERCENTILE_CONT(age, 0.5) OVER() median,
    PERCENTILE_CONT(age, 0.99) OVER() percentile99,
    PERCENTILE_CONT(age, 1) OVER() max,
FROM cte LIMIT 1"
+-----+-------------+--------+--------------+-----+
| min | percentile1 | median | percentile99 | max |
+-----+-------------+--------+--------------+-----+
|   0 |         0.2 |     10 |         19.8 |  20 |
+-----+-------------+--------+--------------+-----+

- Given 500, 50, and 100

Expected:

+-----+-------------+--------+--------------+-----+
| min | percentile1 | median | percentile99 | max |
+-----+-------------+--------+--------------+-----+
|50.0 | 51.0        | 100.0  | 492.0        |500.0|
+-----+-------------+--------+--------------+-----+

Actual:

bq --api http://0.0.0.0:9050 query --project_id=bq-test "WITH cte AS
    (SELECT 500 as age UNION ALL SELECT 50 as age UNION ALL SELECT 100 as age)
SELECT
    PERCENTILE_CONT(age, 0) OVER() min,
    PERCENTILE_CONT(age, 0.01) OVER() percentile1,
    PERCENTILE_CONT(age, 0.5) OVER() median,
    PERCENTILE_CONT(age, 0.99) OVER() percentile99,
    PERCENTILE_CONT(age, 1) OVER() max,
FROM cte LIMIT 1"
+-----+-------------+--------+--------------+-----+
| min | percentile1 | median | percentile99 | max |
+-----+-------------+--------+--------------+-----+
|   0 |         4.5 |    225 |        445.5 | 450 |
+-----+-------------+--------+--------------+-----+

heyweswu avatar Jun 30 '23 20:06 heyweswu

It appears the percentiles are being calculated using the range (max - min) multiplied by the percentile.

And the working cases conveniently work because they have a min value of 0?

heyweswu avatar Jun 30 '23 21:06 heyweswu

@heyweswu can this be closed? Did your PR fix the issue?

ohaibbq avatar Apr 11 '24 23:04 ohaibbq