dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[ENH] Add percentile function(s)

Open randerzander opened this issue 3 years ago • 4 comments

I'd like to compute median using percentiles of a column (this is also supported in Spark SQL):

>>> import pandas as pd
>>> from dask_sql import Context
>>> c = Context()
>>> df = pd.DataFrame({'thing': ['a', 'b', 'c'], 'value': [0, 1, 2]})
>>> c.create_table('things', df)
>>> c.sql("select percentile(value, 0.5) from things")                              
Traceback (most recent call last):                                                 
  File "<stdin>", line 1, in <module>
  File "/home/rgelhausen/conda/envs/dsql-12-09/lib/python3.8/site-packages/dask_sql/context.py", line 458, in sql
    rel, select_names, _ = self._get_ral(sql)
  File "/home/rgelhausen/conda/envs/dsql-12-09/lib/python3.8/site-packages/dask_sql/context.py", line 868, in _get_ral
    raise ParsingException(sql, str(e.message())) from None
dask_sql.utils.ParsingException: Can not parse the given SQL: Encountered "( value" at line 1, column 18.
Was expecting one of:              
    "." ...            
    "(" "*" ...      
    "(" ")" ...         
    "(" "DISTINCT" ...
    "(" "ALL" ...       
...
The problem is probably somewhere here:

        select percentile(value, 0.5) from things

I also tried Postgres's percentile_disc:

>>> c.sql("""       
... select               
...   percentile_disc(0.25) within group (order by things.value),                  
...   percentile_disc(0.5) within group (order by things.value),
...   percentile_disc(0.75) within group (order by things.value)
... from things         
... """)  
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rgelhausen/conda/envs/dsql-12-09/lib/python3.8/site-packages/dask_sql/context.py", line 458, in sql
    rel, select_names, _ = self._get_ral(sql)
  File "/home/rgelhausen/conda/envs/dsql-12-09/lib/python3.8/site-packages/dask_sql/context.py", line 868, in _get_ral
    raise ParsingException(sql, str(e.message())) from None
dask_sql.utils.ParsingException: Can not parse the given SQL: Encountered "percentile_disc (" at line 3, column 3.
Was expecting one of:
    "/*+" ...           
    "*" ...              
    "+" ...             
    "-" ...          
    "NOT" ...          
...
The problem is probably somewhere here:


        select
          percentile_disc(0.25) within group (order by things.value),
          ^
          percentile_disc(0.5) within group (order by things.value),
          percentile_disc(0.75) within group (order by things.value)
        from things

randerzander avatar Dec 09 '21 19:12 randerzander

Interesting. I can take a look at this

jdye64 avatar Dec 17 '21 16:12 jdye64

https://github.com/rapidsai/cudf/issues/4706 probably a blocker

randerzander avatar Dec 20 '21 18:12 randerzander

The percentile(col) use case may not be blocked by https://github.com/rapidsai/cudf/issues/4706 , as the top-level DataFrame has a quantile implementation and delegates appropriately.

import dask
import dask_cudf
​
ddf = dask.datasets.timeseries(
    start='2000-01-01',
    end='2000-01-31',
    freq='10s',
)
gddf = dask_cudf.from_dask_dataframe(ddf)
print(gddf.x.quantile([0.5]).compute())
print(gddf.x.quantile(0.5).compute())
0.5    0.019074
Name: x, dtype: float64
0.019073669821518924

For group / window percentiles (approx or otherwise), we'd likely want to resolve the linked issue and implement this in Dask

beckernick avatar Feb 04 '22 19:02 beckernick

Happy to look into this a bit more

ayushdg avatar Feb 15 '22 19:02 ayushdg