Problem description

Using this as a sort of backlog for polars-sql functionality.

This is by no means an exhaustive list of sql functionality, but serves as an unprioritized checklist of functionality that can be added to polars sql. Some of the functions below may be irrelevant in polars-sql.

Feel free to add/remove/suggest any other SQL functions that you feel are of importance to polars-sql.

Community requested

[x] CASE <EXPR>
[x] COALESCE <EXPR>
[x] DESCRIBE <STATEMENT>
[x] DROP TABLE <STATEMENT>
[x] EXPLAIN <STATEMENT>
[x] IF <EXPR>
[x] IFNULL <EXPR>
[ ] IN (<subquery>)
[x] NULLIF <EXPR>
[ ] TRUNCATE TABLE <STATEMENT>

Functions

Miscellaneous

[x] WITH as (CTE's)
[x] LEFT|RIGHT ANTI JOIN
[x] LEFT|RIGHT SEMI JOIN

Pattern Matching

[x] LIKE
[x] ILIKE
[ ] SIMILAR TO
[x] RLIKE
[x] REGEXP
[x] ~: contains regex (case-sensitive)
[x] ~*: contains regex (case-insensitive)
[x] !~: does not contain regex (case-sensitive)
[x] !~*: does not contain regex (case-insensitive)

Null-aware comparison

[x] <=> operator
[x] IS DISTINCT FROM
[x] IS NOT DISTINCT FROM

Mathematical

[x] cbrt
[x] degrees
[ ] factorial
[ ] gcd
[ ] lcm
[ ] min_scale
[x] mod
[x] pi
[x] power
[x] radians
[x] round
[ ] scale
[x] sign
[x] sqrt
[ ] trim_scale
[ ] trunc
[ ] width_bucket

Random Functions

[ ] random
[ ] setseed

Trig Functions

[x] acosd
[x] asind
[x] atan2
[x] atan2d
[x] cos
[x] cosd
[x] cot
[x] cotd
[x] sin
[x] sind
[x] tan
[x] tand

String Functions

[ ] ascii
[x] bit_length
[ ] btrim
[x] char_length
[ ] chr
[x] concat
[x] concat_ws
[ ] format
[x] initcap
[x] left
[x] length
[ ] lpad
[ ] md5
[ ] normalize
[x] octet_length
[ ] overlay
[ ] parse_ident
[x] position
[ ] quote_ident
[ ] quote_literal
[ ] quote_nullable
[ ] regexp_count
[ ] regexp_instr
[x] regexp_like
[ ] regexp_match
[ ] regexp_matches
[ ] regexp_replace
[ ] regexp_split_to_array
[ ] regexp_split_to_table
[ ] regexp_substr
[ ] repeat
[x] replace
[x] reverse
[x] right
[ ] rpad
[ ] split_part
[ ] string_to_array
[ ] string_to_table
[x] strpos
[x] substr
[ ] substring
[ ] to_ascii
[ ] to_hex
[ ] translate
[ ] unistr

Binary String Functions

[ ] bit_count
[ ] btrim
[ ] get_bit
[ ] get_byte
[ ] ltrim
[ ] md5
[ ] rtrim
[ ] set_bit
[ ] set_byte
[ ] sha224
[ ] sha256
[ ] sha384
[ ] sha512
[ ] substr

Binary String Conversion Functions

[ ] convert
[ ] convert_from
[ ] convert_to
[ ] decode
[ ] encode

Date functions

[ ] age
[ ] clock_timestamp
[ ] current_date
[ ] current_time
[ ] current_timestamp
[ ] date_bin
[x] date_part
[ ] date_trunc
[x] extract
[ ] isfinite
[ ] justify_days
[ ] justify_hours
[ ] localtime
[ ] localtimestamp
[ ] make_interval
[ ] make_time
[ ] make_timestamp
[ ] make_timestamptz
[ ] now
[ ] statement_timestamp
[ ] timeofday
[ ] transaction_timestamp
[ ] to_timestamp

Read Functions

[ ] read_json
[x] read_ndjson
[x] read_csv with args
[x] read_parquet with args
[x] read_ipc with args

Write functions

[ ] COPY df to 'path/to/file.csv
[ ] COPY df to 'path/to/file.json
[ ] COPY df to 'path/to/file.parquet
[ ] COPY df to 'path/to/file.ipc

Feb 27 '23 15:02 universalmind303

Hi @universalmind303 ,

When I think about SQL, I always think about CTEs and Window functions (including specifying the window frame with range/rows). Would you know if this is planned?

Feb 27 '23 16:02 lucazanna

@lucazanna we do support window functions sum(x) over (partition by y). If there is anything that specifically doesn't work, please provide some sql examples & we can look at adding. CTE's are planned & I updated the list to include CTEs

Feb 27 '23 16:02 universalmind303

Thank you @universalmind303 for the CTE!

For the window function, I often use the 'frame type' option: row frame or range frame.

Below is a description I found for Spark SQL (https://www.databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html)

OVER (PARTITION BY ... ORDER BY ... frame_type BETWEEN start AND end) Here, frame_type can be either ROWS (for ROW frame) or RANGE (for RANGE frame); start can be any of UNBOUNDED PRECEDING, CURRENT ROW, PRECEDING, and FOLLOWING; and end can be any of UNBOUNDED FOLLOWING, CURRENT ROW, PRECEDING, and FOLLOWING.

I would like to do something similar to the first answer on this SO question: https://stackoverflow.com/questions/30861919/what-is-rows-unbounded-preceding-used-for-in-teradata

And this is what I tried:

df = pl.LazyFrame({
    't': [1,2,3,4,5,6],
    'a': [1,5,3,5,4,11]
})

sql = pl.SQLContext()
sql.register('df',df)

sql.query("""
SELECT t, a, sum(a) OVER (ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM df
ORDER BY t
""")

I have the error: ComputeError: expected keys in groupby operation, got nothing

Feb 27 '23 17:02 lucazanna

I actually wanted to open an Enhancement request regarding time functions.

If I have a pl.Datetime column, how can I test time components? Something like time_col.dt.hour < 10 in an SQL context. Seems like extract/date_part is required for that.

Feb 28 '23 14:02 2-5

Yes please for CTEs! Main thing I feel is missing for SQL.

Apr 08 '23 17:04 potter-potter

not sure how long it would take to implement them, but would it be possible to have the RANK function and the EXTRACT function?

Extract will allow for easier working with dates, and rank functions is useful in several analytical queries

Apr 21 '23 16:04 lucazanna

@ritchie46, I've been thinking about #8425, and wanted to get your thoughts on the direction you'd like to see out of polars sql support. Maybe this is a bit premature, as we don't even fully support all of the lazyframe api yet. But I see it as one of 2 directions.

a lightweight & familiar way of expressing a logical plan via SQL (simliar to datafusion)

This would entail exposing all functionality in the logical plan/lazyframe to SQL.

I think we are already well on track to this one. There's still a long way to go, but we're making good progress on increasing the SQL query functionality.

a full fledged database (similar to duckdb or sqlite)

So this would entail all of the above, but also

additional statements for managing the data, (information_schema, create schema, create database, ...),
ways of persisting dataframes (tables). In addition to our current support for registering lazyframes (external tables)
support of some DBC, (likely ADBC)

Apr 25 '23 02:04 universalmind303

Given that 2, needs 1. I'd say lets look at 1 first and see what the world looks like once we are upon that hill? :)

Apr 25 '23 11:04 ritchie46

Makes sense, Wasn't sure if there was a clear direction you wanted to head in. I guess it makes sense to do the first one before trying to implement ADBC, or some of those other larger tasks associated with the latter.

Apr 25 '23 14:04 universalmind303

I'd like to add the following to this list,

describe (shows schema)
explain (shows optimized plan)
read_delta (reads a delta dataset. since this is on python side not sure if it will be possible)

May 17 '23 10:05 chitralverma

Could we please have a new category for basic functions like the ones requested in #8901 #8889 #8866?

May 17 '23 21:05 csubhodeep

What about adding priorities (1,2,3, etc) to the functions ? This can give users an idea if those functions are likely to be added earlier or later

May 18 '23 01:05 lucazanna

I've been mostly doing them when issues for specific feature requests come in. I updated the OP with a COMMUNITY REQUESTED section that I'll use as the highest priority features. feel free to suggest more either here, or by opening up individual feature requests. @lucazanna @csubhodeep

May 18 '23 17:05 universalmind303

@universalmind303 : that's a good idea. can we add EXTRACT and RANK to the community requested? I see that EXTRACT was also requested by @2-5 above

and table aliasing :) When working with multiple tables joined together, that makes for more concise syntax

May 19 '23 03:05 lucazanna

I agree with @lucazanna. I think we could have a separate category of window functions. And have prios within them.

We could take the following as references:

Window Functions

May 19 '23 11:05 csubhodeep

Where could we keep a priority list for bug fixes such as #8890 .

Also @universalmind303 @ritchie46 @alexander-beedie could we please add #8918 to the Community requested section.

Also would there be a possibility to have some kind of polls for the functions/features mentioned above. It may require to move this thread from Issue to Discussions (which is not enabled for this repo I believe).

Jun 08 '23 12:06 csubhodeep

+100 on moving and breaking up this thread to multiple threads in GitHub Discussions

Jun 08 '23 19:06 jqnatividad

Polars SQL has CREATE TABLE <TBLNAME> AS SELECT support.

It'd be great if it also supports the complementary DROP TABLE and TRUNCATE TABLE statements.

Jun 09 '23 19:06 jqnatividad

Additional date functions:

justify_interval
localtime(integer) - Current time of day, with limited precision
localtimestamp(integer) - Current date and time, with limited precision
make_date

In addition, the OVERLAPS operator as well .

This would pretty much make Polars SQL at parity with PostgreSQL 15's date/time capabilities

Jun 11 '23 16:06 jqnatividad

request for additional SQL function

ISNULL(expression, alt_value) => value

(edit typo ifnull to isnull)

returns the result of the expression in every single case, except if the resulting value is a null value. If it is a null value then return the alternative value which can be either a default value or another expression.

used it for error handling, to catch if any subqueries failed in a stored proc and if so provide them a value put in place of causing a stored proc to fail.

Jun 21 '23 07:06 hennyboiszz

I have two additional requests:

More robust statistical functions like quantile, median, skew...
to be able to register my own UDF to call from SQL if that is possible

Mar 11 '24 14:03 brunorpinho

More robust statistical functions like quantile, median, skew...

@brunorpinho: FYI, just added MEDIAN - now available in 0.20.19; will have a look at percentiles a bit later. I don't believe there is a standard PostgreSQL function for skewness, but I note that DuckDB has once, so I can see if any other DBs also have it and add according to the most common syntax 🤔

Apr 08 '24 19:04 alexander-beedie

ISNULL(expression, alt_value) => value

@hennyboiszz: This is equivalent to COALESCE(expression, alt_value), which we support 👌 (ISNULL is also not PostgreSQL syntax, which is the flavour of SQL that we most closely adhere to; IFNULL is also available though).

Apr 08 '24 19:04 alexander-beedie

It'd be great if it also supports the complementary DROP TABLE and TRUNCATE TABLE statements.

@jqnatividad: Done - DROP TABLE was added a little while ago, and TRUNCATE TABLE just landed in 0.20.19 ;)

Apr 08 '24 19:04 alexander-beedie

Following up on @csubhodeep 's original May 2023 request to add PostgreSQL's window functions to Polars SQL, I'm adding my vote to add them...

https://www.postgresql.org/docs/current/functions-window.html

Apr 19 '24 03:04 jqnatividad

polars
polars copied to clipboard

RFC: Additional SQL functions

Problem description

Community requested

Functions

Miscellaneous

Pattern Matching

Null-aware comparison

Mathematical

Random Functions

Trig Functions

String Functions

Binary String Functions

Binary String Conversion Functions

Date functions

Read Functions

Write functions

a lightweight & familiar way of expressing a logical plan via SQL (simliar to datafusion)

a full fledged database (similar to duckdb or sqlite)

polars polars copied to clipboard

RFC: Additional SQL functions

Problem description

Community requested

Functions

Miscellaneous

Pattern Matching

Null-aware comparison

Mathematical

Random Functions

Trig Functions

String Functions

Binary String Functions

Binary String Conversion Functions

Date functions

Read Functions

Write functions

a lightweight & familiar way of expressing a logical plan via SQL (simliar to datafusion)

a full fledged database (similar to duckdb or sqlite)

polars
polars copied to clipboard