polars
polars copied to clipboard
RFC: Additional SQL functions
Problem description
Using this as a sort of backlog for polars-sql functionality.
This is by no means an exhaustive list of sql functionality, but serves as an unprioritized checklist of functionality that can be added to polars sql. Some of the functions below may be irrelevant in polars-sql.
Feel free to add/remove/suggest any other SQL functions that you feel are of importance to polars-sql.
Community requested
- [x]
CASE
<EXPR> - [x]
COALESCE
<EXPR> - [x]
DESCRIBE
<STATEMENT> - [x]
DROP TABLE
<STATEMENT> - [x]
EXPLAIN
<STATEMENT> - [x]
IF
<EXPR> - [x]
IFNULL
<EXPR> - [ ]
IN (<subquery>)
- [x]
NULLIF
<EXPR> - [ ]
TRUNCATE TABLE
<STATEMENT>
Functions
Miscellaneous
- [x]
WITH as
(CTE's) - [x]
LEFT|RIGHT ANTI JOIN
- [x]
LEFT|RIGHT SEMI JOIN
Pattern Matching
- [x]
LIKE
- [x]
ILIKE
- [ ]
SIMILAR TO
- [x]
RLIKE
- [x]
REGEXP
- [x]
~
: contains regex (case-sensitive) - [x]
~*
: contains regex (case-insensitive) - [x]
!~
: does not contain regex (case-sensitive) - [x]
!~*
: does not contain regex (case-insensitive)
Null-aware comparison
- [x]
<=>
operator - [x]
IS DISTINCT FROM
- [x]
IS NOT DISTINCT FROM
Mathematical
- [x] cbrt
- [x] degrees
- [ ] factorial
- [ ] gcd
- [ ] lcm
- [ ] min_scale
- [x] mod
- [x] pi
- [x] power
- [x] radians
- [x] round
- [ ] scale
- [x] sign
- [x] sqrt
- [ ] trim_scale
- [ ] trunc
- [ ] width_bucket
Random Functions
- [ ] random
- [ ] setseed
Trig Functions
- [x] acosd
- [x] asind
- [x] atan2
- [x] atan2d
- [x] cos
- [x] cosd
- [x] cot
- [x] cotd
- [x] sin
- [x] sind
- [x] tan
- [x] tand
String Functions
- [ ] ascii
- [x] bit_length
- [ ] btrim
- [x] char_length
- [ ] chr
- [x] concat
- [x] concat_ws
- [ ] format
- [x] initcap
- [x] left
- [x] length
- [ ] lpad
- [ ] md5
- [ ] normalize
- [x] octet_length
- [ ] overlay
- [ ] parse_ident
- [x] position
- [ ] quote_ident
- [ ] quote_literal
- [ ] quote_nullable
- [ ] regexp_count
- [ ] regexp_instr
- [x] regexp_like
- [ ] regexp_match
- [ ] regexp_matches
- [ ] regexp_replace
- [ ] regexp_split_to_array
- [ ] regexp_split_to_table
- [ ] regexp_substr
- [ ] repeat
- [x] replace
- [x] reverse
- [x] right
- [ ] rpad
- [ ] split_part
- [ ] string_to_array
- [ ] string_to_table
- [x] strpos
- [x] substr
- [ ] substring
- [ ] to_ascii
- [ ] to_hex
- [ ] translate
- [ ] unistr
Binary String Functions
- [ ] bit_count
- [ ] btrim
- [ ] get_bit
- [ ] get_byte
- [ ] ltrim
- [ ] md5
- [ ] rtrim
- [ ] set_bit
- [ ] set_byte
- [ ] sha224
- [ ] sha256
- [ ] sha384
- [ ] sha512
- [ ] substr
Binary String Conversion Functions
- [ ] convert
- [ ] convert_from
- [ ] convert_to
- [ ] decode
- [ ] encode
Date functions
- [ ] age
- [ ] clock_timestamp
- [ ] current_date
- [ ] current_time
- [ ] current_timestamp
- [ ] date_bin
- [x] date_part
- [ ] date_trunc
- [x] extract
- [ ] isfinite
- [ ] justify_days
- [ ] justify_hours
- [ ] localtime
- [ ] localtimestamp
- [ ] make_interval
- [ ] make_time
- [ ] make_timestamp
- [ ] make_timestamptz
- [ ] now
- [ ] statement_timestamp
- [ ] timeofday
- [ ] transaction_timestamp
- [ ] to_timestamp
Read Functions
- [ ]
read_json
- [x]
read_ndjson
- [x]
read_csv
with args - [x]
read_parquet
with args - [x]
read_ipc
with args
Write functions
- [ ]
COPY df to 'path/to/file.csv
- [ ]
COPY df to 'path/to/file.json
- [ ]
COPY df to 'path/to/file.parquet
- [ ]
COPY df to 'path/to/file.ipc
Hi @universalmind303 ,
When I think about SQL, I always think about CTEs and Window functions (including specifying the window frame with range/rows). Would you know if this is planned?
@lucazanna we do support window functions sum(x) over (partition by y)
. If there is anything that specifically doesn't work, please provide some sql examples & we can look at adding. CTE's are planned & I updated the list to include CTEs
Thank you @universalmind303 for the CTE!
For the window function, I often use the 'frame type' option: row frame or range frame.
Below is a description I found for Spark SQL (https://www.databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html)
OVER (PARTITION BY ... ORDER BY ... frame_type BETWEEN start AND end) Here, frame_type can be either ROWS (for ROW frame) or RANGE (for RANGE frame); start can be any of UNBOUNDED PRECEDING, CURRENT ROW, PRECEDING, and FOLLOWING; and end can be any of UNBOUNDED FOLLOWING, CURRENT ROW, PRECEDING, and FOLLOWING.
I would like to do something similar to the first answer on this SO question: https://stackoverflow.com/questions/30861919/what-is-rows-unbounded-preceding-used-for-in-teradata
And this is what I tried:
df = pl.LazyFrame({
't': [1,2,3,4,5,6],
'a': [1,5,3,5,4,11]
})
sql = pl.SQLContext()
sql.register('df',df)
sql.query("""
SELECT t, a, sum(a) OVER (ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM df
ORDER BY t
""")
I have the error: ComputeError: expected keys in groupby operation, got nothing
I actually wanted to open an Enhancement request regarding time functions.
If I have a pl.Datetime
column, how can I test time components? Something like time_col.dt.hour < 10
in an SQL context. Seems like extract/date_part
is required for that.
Yes please for CTEs! Main thing I feel is missing for SQL.
not sure how long it would take to implement them, but would it be possible to have the RANK function and the EXTRACT function?
Extract will allow for easier working with dates, and rank functions is useful in several analytical queries
@ritchie46, I've been thinking about #8425, and wanted to get your thoughts on the direction you'd like to see out of polars sql support. Maybe this is a bit premature, as we don't even fully support all of the lazyframe api yet. But I see it as one of 2 directions.
a lightweight & familiar way of expressing a logical plan via SQL (simliar to datafusion)
This would entail exposing all functionality in the logical plan/lazyframe to SQL.
I think we are already well on track to this one. There's still a long way to go, but we're making good progress on increasing the SQL query functionality.
a full fledged database (similar to duckdb or sqlite)
So this would entail all of the above, but also
- additional statements for managing the data, (
information_schema
,create schema
,create database
, ...), - ways of persisting dataframes (tables). In addition to our current support for registering lazyframes (external tables)
- support of some DBC, (likely ADBC)
Given that 2, needs 1. I'd say lets look at 1 first and see what the world looks like once we are upon that hill? :)
Makes sense, Wasn't sure if there was a clear direction you wanted to head in. I guess it makes sense to do the first one before trying to implement ADBC, or some of those other larger tasks associated with the latter.
I'd like to add the following to this list,
- describe (shows schema)
- explain (shows optimized plan)
- read_delta (reads a delta dataset. since this is on python side not sure if it will be possible)
Could we please have a new category for basic functions like the ones requested in #8901 #8889 #8866?
What about adding priorities (1,2,3, etc) to the functions ? This can give users an idea if those functions are likely to be added earlier or later
I've been mostly doing them when issues for specific feature requests come in. I updated the OP with a COMMUNITY REQUESTED section that I'll use as the highest priority features. feel free to suggest more either here, or by opening up individual feature requests. @lucazanna @csubhodeep
@universalmind303 : that's a good idea. can we add EXTRACT and RANK to the community requested? I see that EXTRACT was also requested by @2-5 above
and table aliasing :) When working with multiple tables joined together, that makes for more concise syntax
I agree with @lucazanna. I think we could have a separate category of window functions. And have prios within them.
We could take the following as references:
Where could we keep a priority list for bug fixes such as #8890 .
Also @universalmind303 @ritchie46 @alexander-beedie could we please add #8918 to the Community requested section.
Also would there be a possibility to have some kind of polls for the functions/features mentioned above. It may require to move this thread from Issue to Discussions (which is not enabled for this repo I believe).
+100 on moving and breaking up this thread to multiple threads in GitHub Discussions
Polars SQL has CREATE TABLE <TBLNAME> AS SELECT
support.
It'd be great if it also supports the complementary DROP TABLE
and TRUNCATE TABLE
statements.
Additional date functions:
- justify_interval
- localtime(integer) - Current time of day, with limited precision
- localtimestamp(integer) - Current date and time, with limited precision
- make_date
In addition, the OVERLAPS
operator as well .
This would pretty much make Polars SQL at parity with PostgreSQL 15's date/time capabilities
request for additional SQL function
- ISNULL(expression, alt_value) => value
(edit typo ifnull to isnull)
returns the result of the expression in every single case, except if the resulting value is a null value. If it is a null value then return the alternative value which can be either a default value or another expression.
used it for error handling, to catch if any subqueries failed in a stored proc and if so provide them a value put in place of causing a stored proc to fail.
I have two additional requests:
-
More robust statistical functions like quantile, median, skew...
-
to be able to register my own UDF to call from SQL if that is possible
- More robust statistical functions like quantile, median, skew...
@brunorpinho: FYI, just added MEDIAN
- now available in 0.20.19; will have a look at percentiles a bit later. I don't believe there is a standard PostgreSQL function for skewness, but I note that DuckDB has once, so I can see if any other DBs also have it and add according to the most common syntax 🤔
ISNULL(expression, alt_value) => value
@hennyboiszz: This is equivalent to COALESCE(expression, alt_value)
, which we support 👌 (ISNULL
is also not PostgreSQL syntax, which is the flavour of SQL that we most closely adhere to; IFNULL
is also available though).
It'd be great if it also supports the complementary
DROP TABLE
andTRUNCATE TABLE
statements.
@jqnatividad: Done - DROP TABLE
was added a little while ago, and TRUNCATE TABLE
just landed in 0.20.19
;)
Following up on @csubhodeep 's original May 2023 request to add PostgreSQL's window functions to Polars SQL, I'm adding my vote to add them...
https://www.postgresql.org/docs/current/functions-window.html