dask-sql
dask-sql copied to clipboard
[ENH] Add percentile function(s)
I'd like to compute median using percentiles of a column (this is also supported in Spark SQL):
>>> import pandas as pd
>>> from dask_sql import Context
>>> c = Context()
>>> df = pd.DataFrame({'thing': ['a', 'b', 'c'], 'value': [0, 1, 2]})
>>> c.create_table('things', df)
>>> c.sql("select percentile(value, 0.5) from things")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/rgelhausen/conda/envs/dsql-12-09/lib/python3.8/site-packages/dask_sql/context.py", line 458, in sql
rel, select_names, _ = self._get_ral(sql)
File "/home/rgelhausen/conda/envs/dsql-12-09/lib/python3.8/site-packages/dask_sql/context.py", line 868, in _get_ral
raise ParsingException(sql, str(e.message())) from None
dask_sql.utils.ParsingException: Can not parse the given SQL: Encountered "( value" at line 1, column 18.
Was expecting one of:
"." ...
"(" "*" ...
"(" ")" ...
"(" "DISTINCT" ...
"(" "ALL" ...
...
The problem is probably somewhere here:
select percentile(value, 0.5) from things
I also tried Postgres's percentile_disc:
>>> c.sql("""
... select
... percentile_disc(0.25) within group (order by things.value),
... percentile_disc(0.5) within group (order by things.value),
... percentile_disc(0.75) within group (order by things.value)
... from things
... """)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/rgelhausen/conda/envs/dsql-12-09/lib/python3.8/site-packages/dask_sql/context.py", line 458, in sql
rel, select_names, _ = self._get_ral(sql)
File "/home/rgelhausen/conda/envs/dsql-12-09/lib/python3.8/site-packages/dask_sql/context.py", line 868, in _get_ral
raise ParsingException(sql, str(e.message())) from None
dask_sql.utils.ParsingException: Can not parse the given SQL: Encountered "percentile_disc (" at line 3, column 3.
Was expecting one of:
"/*+" ...
"*" ...
"+" ...
"-" ...
"NOT" ...
...
The problem is probably somewhere here:
select
percentile_disc(0.25) within group (order by things.value),
^
percentile_disc(0.5) within group (order by things.value),
percentile_disc(0.75) within group (order by things.value)
from things
Interesting. I can take a look at this
https://github.com/rapidsai/cudf/issues/4706 probably a blocker
The percentile(col)
use case may not be blocked by https://github.com/rapidsai/cudf/issues/4706 , as the top-level DataFrame has a quantile implementation and delegates appropriately.
import dask
import dask_cudf
ddf = dask.datasets.timeseries(
start='2000-01-01',
end='2000-01-31',
freq='10s',
)
gddf = dask_cudf.from_dask_dataframe(ddf)
print(gddf.x.quantile([0.5]).compute())
print(gddf.x.quantile(0.5).compute())
0.5 0.019074
Name: x, dtype: float64
0.019073669821518924
For group / window percentiles (approx or otherwise), we'd likely want to resolve the linked issue and implement this in Dask
Happy to look into this a bit more