polars icon indicating copy to clipboard operation
polars copied to clipboard

Executing an SQL query over a dataset

Open calavia88 opened this issue 2 years ago • 8 comments

I would like to execute SQL queries over a dataframe or lazyframe

We expect to launch queries like: df.query("select * from df")

calavia88 avatar Mar 10 '22 15:03 calavia88

Hi, I'll just chime in here with question on how should we approach this problems.

It could be simple as the above example, but it could be as complex as what datafusion already achieved.

What do you think @ritchie46 @ghuls . Should we go as simple as adding SQL support for simple query (SELECT, GROUP BY) on single dataframe, similar to pandas Query? Or we will have context, register dataframe with a context and execute query on that context (Similar to spark and datafusion), which then can support JOIN, UNION and many other complex stuff. For example:

ctx = pl.sql.Context()
ctx.register(df1, 'df1')
ctx.register(df2, 'df2')
df3 = ctx.execute('... SOME SQL QUERY ')

potter420 avatar Mar 23 '22 18:03 potter420

Hi,

What I wanted is to create something more complex as you put in you example. Because my example is too easy at the end.

calavia88 avatar Mar 23 '22 18:03 calavia88

@potter420, I think we should add a context if we decide to do this. Whereas you can have pandas behavior as a subset of that context.

There has already been a very very simple start on this here: #2880

Anybody willing to improve/extend upon that is more than welcome. It is something I'd like to have in this project, but currently don't have the bandwith for to do on my own. There is already a postgres complient sql parser in Rust, so we need to make the mapping between that AST to a polars logical plan.

This should also open up the possibilites for a polars sql cli that queries different files.

ritchie46 avatar Mar 23 '22 19:03 ritchie46

Alright, complex it is then. I'll laid out some to do list here, please correct me :D

  • [ ] Context Object:

    • [ ] HashMap<String, DataFrame> that map a name to a pointer to dataframe.
    • [ ] execute
    • [ ] register
  • [ ] Query Parser: Basically translate SQL query to polars lazy expression, which will supports:

    • [ ] SELECT
    • [ ] GROUP BY
    • [ ] JOIN
    • [ ] UNION/EXCEPT/INTERSECT
    • [ ] WINDOW AGG
    • [ ] SQL function to polars: Static Map here, will consult postgres documentation as a reference
    • [ ] CREATE EXTERNAL TABLE: this will be used to read file, look the same as impala, Check this link. External is in a sense that the DataFrame was not explicitly declared in the Context
    • [ ] CREATE TABLE AS ...: Query and then save the result into the context for further processing.
    • [ ] UNLOAD (EXPR) ...: Write file to storage, could be external (optional), look the same as what Redshift did

Please do comment if anything is missing and I will add them to the list.

IMO, since we're dealing with immutable DataFrame, I would suggest that this context will stay away from all DML syntax like INSERT, UPDATE, DELETE. Rather we should go with the route of creating new DataFrame that are the result of DML, INSERT through bulk insert (CREATE new DF and UNION them with the existing). UPDATE through create new columns, DELETE by create new filtered DF.

potter420 avatar Mar 24 '22 02:03 potter420

IMO, since we're dealing with immutable DataFrame, I would suggest that this context will stay away from all DML syntax like INSERT, UPDATE, DELETE. Rather we should go with the route of creating new DataFrame that are the result of DML, INSERT through bulk insert (CREATE new DF and UNION them with the existing). UPDATE through create new columns, DELETE by create new filtered DF.

Yep, sounds good. Not that polars does allow mutations since not too long ago. But idiomatically I don't think we will need them much.

ritchie46 avatar Mar 24 '22 18:03 ritchie46

@potter420 You might find some good inspiration from duckdb, which does much the same thing (we're now using it internally -at work- for operating more efficiently on intermediary pandas DataFrames from research, but it also groks arrow, csv, parquet, etc. Evaluating it for some out-of-core workloads next).

alexander-beedie avatar Aug 10 '22 09:08 alexander-beedie

Hi,

I am also very interested in querying Polars Dataframes with SQL. I am working with the folks over at PRQL and in particular I'm building a prql-tool to make it easier to use PRQL to query existing datasources. I would love to support Polars as a backend but we currently only emit SQL.

Moreover there is now a proliferation of these DataFrame APIs which are all very similar to each other but also independent. In particular I'm thinking of Polars API, Pandas API, Datafusion DataFrame API, Ibis, dplyr, Spark, to name just a few and there's probably more.

The Substrait project recognises this and tries to provide a single Intermediate Representation (IR) to allow more interoperability. How about implementing a Substrait Consumer rather than YASP (Yet Another SQL Parser)? My sense is that the Substrait IR would be much more closely aligned to the Polars API and therefore should be more simple to implement. A further benefit would be that you would get more interoperability with other Query Frontends without limiting yourself to a particular SQL dialect.

Datafusion already has a substrait extension package in Rust datafusion-substrait with producer and consumer implementations and their consumer implementation might be a useful starting point for you to assess the viability of implementing a Substrait Consumer for Polars. In order to still be able to consume SQL queries, you might be able to use/lean on the Datafusion Substrait Producer to parse the SQL queries and produce the Substrait IR.

What do you think @potter420 and @ritchie46 ?

P.S. Early adopters of Substrait appear to be Ibis, Datafusion, and DuckDB.

P.P.S. I searched through your Issues list and there seemed to be no other mention of Substrait so far.

snth avatar Sep 21 '22 08:09 snth

This is an excellent overview of the Substrait project: Substrait: Rethinking DBMS Composability (VLDB 2022, CDMS workshop)

image

image

snth avatar Sep 22 '22 20:09 snth

I am interested in supporting this. The big impediment is currently my limited bandwith, so any help would be great.

ritchie46 avatar Sep 23 '22 09:09 ritchie46

Just in case y'all haven't seen it yet, there's a Substrait validator written in Rust ( https://github.com/substrait-io/substrait-validator ) that could at the very least get you started on Substrait protobufs and what they contain. I'd love this feature to make R bindings for polars (I'm involved in writing an R substrait producer...if polars implemented a consumer it would make it much, much easier to happen. As incentive, DuckDB implements SQL->Substrait, so you'd get that for free (well, for an optional DuckDB dependency).

paleolimbot avatar Sep 23 '22 18:09 paleolimbot

closing this as completed.

universalmind303 avatar Apr 23 '23 18:04 universalmind303

Related to substrait #7404

eitsupi avatar May 17 '23 10:05 eitsupi