polars icon indicating copy to clipboard operation
polars copied to clipboard

RFC: Additional SQL functions

Open universalmind303 opened this issue 1 year ago • 20 comments

Problem description

Using this as a sort of backlog for polars-sql functionality.

This is by no means an exhaustive list of sql functionality, but serves as an unprioritized checklist of functionality that can be added to polars sql. Some of the functions below may be irrelevant in polars-sql.

Feel free to add/remove/suggest any other SQL functions that you feel are of importance to polars-sql.

Community requested

  • [x] CASE <EXPR>
  • [x] COALESCE <EXPR>
  • [x] DESCRIBE <STATEMENT>
  • [x] DROP TABLE <STATEMENT>
  • [x] EXPLAIN <STATEMENT>
  • [x] IF <EXPR>
  • [x] IFNULL <EXPR>
  • [ ] IN (<subquery>)
  • [x] NULLIF <EXPR>
  • [ ] TRUNCATE TABLE <STATEMENT>

Functions

Miscellaneous

  • [x] WITH as (CTE's)
  • [x] LEFT|RIGHT ANTI JOIN
  • [x] LEFT|RIGHT SEMI JOIN

Pattern Matching

  • [x] LIKE
  • [x] ILIKE
  • [ ] SIMILAR TO
  • [x] RLIKE
  • [x] REGEXP
  • [x] ~: contains regex (case-sensitive)
  • [x] ~*: contains regex (case-insensitive)
  • [x] !~: does not contain regex (case-sensitive)
  • [x] !~*: does not contain regex (case-insensitive)

Null-aware comparison

  • [x] <=> operator
  • [x] IS DISTINCT FROM
  • [x] IS NOT DISTINCT FROM

Mathematical

  • [x] cbrt
  • [x] degrees
  • [ ] factorial
  • [ ] gcd
  • [ ] lcm
  • [ ] min_scale
  • [x] mod
  • [x] pi
  • [x] power
  • [x] radians
  • [x] round
  • [ ] scale
  • [x] sign
  • [x] sqrt
  • [ ] trim_scale
  • [ ] trunc
  • [ ] width_bucket

Random Functions

  • [ ] random
  • [ ] setseed

Trig Functions

  • [x] acosd
  • [x] asind
  • [x] atan2
  • [x] atan2d
  • [x] cos
  • [x] cosd
  • [x] cot
  • [x] cotd
  • [x] sin
  • [x] sind
  • [x] tan
  • [x] tand

String Functions

  • [ ] ascii
  • [x] bit_length
  • [ ] btrim
  • [x] char_length
  • [ ] chr
  • [x] concat
  • [x] concat_ws
  • [ ] format
  • [x] initcap
  • [x] left
  • [x] length
  • [ ] lpad
  • [ ] md5
  • [ ] normalize
  • [x] octet_length
  • [ ] overlay
  • [ ] parse_ident
  • [x] position
  • [ ] quote_ident
  • [ ] quote_literal
  • [ ] quote_nullable
  • [ ] regexp_count
  • [ ] regexp_instr
  • [x] regexp_like
  • [ ] regexp_match
  • [ ] regexp_matches
  • [ ] regexp_replace
  • [ ] regexp_split_to_array
  • [ ] regexp_split_to_table
  • [ ] regexp_substr
  • [ ] repeat
  • [x] replace
  • [x] reverse
  • [x] right
  • [ ] rpad
  • [ ] split_part
  • [ ] string_to_array
  • [ ] string_to_table
  • [x] strpos
  • [x] substr
  • [ ] substring
  • [ ] to_ascii
  • [ ] to_hex
  • [ ] translate
  • [ ] unistr

Binary String Functions

  • [ ] bit_count
  • [ ] btrim
  • [ ] get_bit
  • [ ] get_byte
  • [ ] ltrim
  • [ ] md5
  • [ ] rtrim
  • [ ] set_bit
  • [ ] set_byte
  • [ ] sha224
  • [ ] sha256
  • [ ] sha384
  • [ ] sha512
  • [ ] substr

Binary String Conversion Functions

  • [ ] convert
  • [ ] convert_from
  • [ ] convert_to
  • [ ] decode
  • [ ] encode

Date functions

  • [ ] age
  • [ ] clock_timestamp
  • [ ] current_date
  • [ ] current_time
  • [ ] current_timestamp
  • [ ] date_bin
  • [x] date_part
  • [ ] date_trunc
  • [x] extract
  • [ ] isfinite
  • [ ] justify_days
  • [ ] justify_hours
  • [ ] localtime
  • [ ] localtimestamp
  • [ ] make_interval
  • [ ] make_time
  • [ ] make_timestamp
  • [ ] make_timestamptz
  • [ ] now
  • [ ] statement_timestamp
  • [ ] timeofday
  • [ ] transaction_timestamp
  • [ ] to_timestamp

Read Functions

  • [ ] read_json
  • [x] read_ndjson
  • [x] read_csv with args
  • [x] read_parquet with args
  • [x] read_ipc with args

Write functions

  • [ ] COPY df to 'path/to/file.csv
  • [ ] COPY df to 'path/to/file.json
  • [ ] COPY df to 'path/to/file.parquet
  • [ ] COPY df to 'path/to/file.ipc

universalmind303 avatar Feb 27 '23 15:02 universalmind303

Hi @universalmind303 ,

When I think about SQL, I always think about CTEs and Window functions (including specifying the window frame with range/rows). Would you know if this is planned?

lucazanna avatar Feb 27 '23 16:02 lucazanna

@lucazanna we do support window functions sum(x) over (partition by y). If there is anything that specifically doesn't work, please provide some sql examples & we can look at adding. CTE's are planned & I updated the list to include CTEs

universalmind303 avatar Feb 27 '23 16:02 universalmind303

Thank you @universalmind303 for the CTE!

For the window function, I often use the 'frame type' option: row frame or range frame.

Below is a description I found for Spark SQL (https://www.databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html)

OVER (PARTITION BY ... ORDER BY ... frame_type BETWEEN start AND end) Here, frame_type can be either ROWS (for ROW frame) or RANGE (for RANGE frame); start can be any of UNBOUNDED PRECEDING, CURRENT ROW, PRECEDING, and FOLLOWING; and end can be any of UNBOUNDED FOLLOWING, CURRENT ROW, PRECEDING, and FOLLOWING.

I would like to do something similar to the first answer on this SO question: https://stackoverflow.com/questions/30861919/what-is-rows-unbounded-preceding-used-for-in-teradata

And this is what I tried:

df = pl.LazyFrame({
    't': [1,2,3,4,5,6],
    'a': [1,5,3,5,4,11]
})

sql = pl.SQLContext()
sql.register('df',df)

sql.query("""
SELECT t, a, sum(a) OVER (ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM df
ORDER BY t
""")

I have the error: ComputeError: expected keys in groupby operation, got nothing

lucazanna avatar Feb 27 '23 17:02 lucazanna

I actually wanted to open an Enhancement request regarding time functions.

If I have a pl.Datetime column, how can I test time components? Something like time_col.dt.hour < 10 in an SQL context. Seems like extract/date_part is required for that.

2-5 avatar Feb 28 '23 14:02 2-5

Yes please for CTEs! Main thing I feel is missing for SQL.

potter-potter avatar Apr 08 '23 17:04 potter-potter

not sure how long it would take to implement them, but would it be possible to have the RANK function and the EXTRACT function?

Extract will allow for easier working with dates, and rank functions is useful in several analytical queries

lucazanna avatar Apr 21 '23 16:04 lucazanna

@ritchie46, I've been thinking about #8425, and wanted to get your thoughts on the direction you'd like to see out of polars sql support. Maybe this is a bit premature, as we don't even fully support all of the lazyframe api yet. But I see it as one of 2 directions.

a lightweight & familiar way of expressing a logical plan via SQL (simliar to datafusion)

This would entail exposing all functionality in the logical plan/lazyframe to SQL.

I think we are already well on track to this one. There's still a long way to go, but we're making good progress on increasing the SQL query functionality.

a full fledged database (similar to duckdb or sqlite)

So this would entail all of the above, but also

  • additional statements for managing the data, (information_schema, create schema, create database, ...),
  • ways of persisting dataframes (tables). In addition to our current support for registering lazyframes (external tables)
  • support of some DBC, (likely ADBC)

universalmind303 avatar Apr 25 '23 02:04 universalmind303

Given that 2, needs 1. I'd say lets look at 1 first and see what the world looks like once we are upon that hill? :)

ritchie46 avatar Apr 25 '23 11:04 ritchie46

Makes sense, Wasn't sure if there was a clear direction you wanted to head in. I guess it makes sense to do the first one before trying to implement ADBC, or some of those other larger tasks associated with the latter.

universalmind303 avatar Apr 25 '23 14:04 universalmind303

I'd like to add the following to this list,

  • describe (shows schema)
  • explain (shows optimized plan)
  • read_delta (reads a delta dataset. since this is on python side not sure if it will be possible)

chitralverma avatar May 17 '23 10:05 chitralverma

Could we please have a new category for basic functions like the ones requested in #8901 #8889 #8866?

csubhodeep avatar May 17 '23 21:05 csubhodeep

What about adding priorities (1,2,3, etc) to the functions ? This can give users an idea if those functions are likely to be added earlier or later

lucazanna avatar May 18 '23 01:05 lucazanna

I've been mostly doing them when issues for specific feature requests come in. I updated the OP with a COMMUNITY REQUESTED section that I'll use as the highest priority features. feel free to suggest more either here, or by opening up individual feature requests. @lucazanna @csubhodeep

universalmind303 avatar May 18 '23 17:05 universalmind303

@universalmind303 : that's a good idea. can we add EXTRACT and RANK to the community requested? I see that EXTRACT was also requested by @2-5 above

and table aliasing :) When working with multiple tables joined together, that makes for more concise syntax

lucazanna avatar May 19 '23 03:05 lucazanna

I agree with @lucazanna. I think we could have a separate category of window functions. And have prios within them.

We could take the following as references:

  1. Window Functions

csubhodeep avatar May 19 '23 11:05 csubhodeep

Where could we keep a priority list for bug fixes such as #8890 .

Also @universalmind303 @ritchie46 @alexander-beedie could we please add #8918 to the Community requested section.

Also would there be a possibility to have some kind of polls for the functions/features mentioned above. It may require to move this thread from Issue to Discussions (which is not enabled for this repo I believe).

csubhodeep avatar Jun 08 '23 12:06 csubhodeep

+100 on moving and breaking up this thread to multiple threads in GitHub Discussions

jqnatividad avatar Jun 08 '23 19:06 jqnatividad

Polars SQL has CREATE TABLE <TBLNAME> AS SELECT support.

It'd be great if it also supports the complementary DROP TABLE and TRUNCATE TABLE statements.

jqnatividad avatar Jun 09 '23 19:06 jqnatividad

Additional date functions:

  • justify_interval
  • localtime(integer) - Current time of day, with limited precision
  • localtimestamp(integer) - Current date and time, with limited precision
  • make_date

In addition, the OVERLAPS operator as well .

This would pretty much make Polars SQL at parity with PostgreSQL 15's date/time capabilities

jqnatividad avatar Jun 11 '23 16:06 jqnatividad

request for additional SQL function

  • ISNULL(expression, alt_value) => value

(edit typo ifnull to isnull)

returns the result of the expression in every single case, except if the resulting value is a null value. If it is a null value then return the alternative value which can be either a default value or another expression.

used it for error handling, to catch if any subqueries failed in a stored proc and if so provide them a value put in place of causing a stored proc to fail.

hennyboiszz avatar Jun 21 '23 07:06 hennyboiszz

I have two additional requests:

  1. More robust statistical functions like quantile, median, skew...

  2. to be able to register my own UDF to call from SQL if that is possible

brunorpinho avatar Mar 11 '24 14:03 brunorpinho

  1. More robust statistical functions like quantile, median, skew...

@brunorpinho: FYI, just added MEDIAN - now available in 0.20.19; will have a look at percentiles a bit later. I don't believe there is a standard PostgreSQL function for skewness, but I note that DuckDB has once, so I can see if any other DBs also have it and add according to the most common syntax 🤔

alexander-beedie avatar Apr 08 '24 19:04 alexander-beedie

ISNULL(expression, alt_value) => value

@hennyboiszz: This is equivalent to COALESCE(expression, alt_value), which we support 👌 (ISNULL is also not PostgreSQL syntax, which is the flavour of SQL that we most closely adhere to; IFNULL is also available though).

alexander-beedie avatar Apr 08 '24 19:04 alexander-beedie

It'd be great if it also supports the complementary DROP TABLE and TRUNCATE TABLE statements.

@jqnatividad: Done - DROP TABLE was added a little while ago, and TRUNCATE TABLE just landed in 0.20.19 ;)

alexander-beedie avatar Apr 08 '24 19:04 alexander-beedie

Following up on @csubhodeep 's original May 2023 request to add PostgreSQL's window functions to Polars SQL, I'm adding my vote to add them...

https://www.postgresql.org/docs/current/functions-window.html

jqnatividad avatar Apr 19 '24 03:04 jqnatividad