polars
polars copied to clipboard
`scan_database` feature
Problem description
I have reading about Ponder recently, and about people's interest in querying a database / data warehouse using a Python-based library instead of SQL.
The reason for preferring a Python library to SQL are multiple :
- DRY
- ease of doing visualizations
- the ease of calling other Python packages for ML and other data work
- etc related to this, Ponder mentions they have 10% of Fortune 100 companies as clients.
Adding a scan_database
to Polars would allow Polars to do the same, with the advantages of Polars on top of it:
- lazy engine
- cleaner and more consistent syntax
- streaming
- etc
I imagine that this would be a very big undertaking as it would mean converting Polars expressions to multiple SQL dialects. However, even starting with only one dialect would allow Polars to help companies using that particular dialect.
I imagine that focusing on 3-4 SQL dialects would represent a sizable part of the market: Big Query, PostgreSQL, SQL Server (or Azure SQL), Snowflake
We might push some optimizations into a SQL
query in the future. I am only thinking about predicates, selections and limits here.
Translating a whole polars plan to SQL seems like a whole can of worms that is far out of scope for this project. We might work with substrait one day, but also that is not something we should be focusing on IMO.
I've done some work on using the AnonymousScan
trait to connect to both postgres & mongo, but to ritchie's point, the current execution engine is not optimized for database queries, but for file reads. While the aforementioned pushdowns work fine for simple queries, when performing more complex ones such as joins, the optimizer isn't currently capable of pushing those down,
I suppose that we could extend the functionality of the optimizer & AnonymousScan
to allow for custom optimization rules & pushdowns. That'd atleast open up the door for supporting this kind of functionality.
@ritchie46 if you want, I already have a working implementation of reading from postgres that supports predicate, projection & slice pushdowns. I'd be happy to submit a PR for it under a feature flag, or even a separate crate.
if @ritchie46 agrees, having predicate, projection, and slice pushdown for Postgres would be a great addition
Hi @ritchie46 and @alexander-beedie, from my understanding, if/once a visitor is exposed on the python side, then the predicates, selections and limits could be pulled out of a lazy query plan? Is that correct? From there, would that open the door to be able to push things down to the database?
If that's accurate, then this issue would be the enabler.
If we get to the stage where you have something ready to push down to a database, what are your thoughts on using sqlglot as a tool to translate SQL into different dialects? With this tool once you have "polars SQL", you could have them "all" (all that are supported by sqlglot - list here)
If this was something you and the polars dev team were open to, then I'd be happy to work on an implementation / PR on their side.
In terms of adoption, they do have a few projects using them (and I think dbt-duckdb
will be added there soon too).
Obviously the counter to this is that it adds another optional dependecy, but there are already a fair few in read/write database. I've not worked with sqlglot before, so potentially it would need to be evaluated for performance too.
Interested to hear your thoughts. Thanks
I am only thinking about predicates, selections and limits here.
Projection/predicate push-down into columnar databases like duckdb or clickhouse would greatly reduce disk reading workloads. This is the real strength of columnar data. I hate manually writing raw SQL strings in my python script to select and filter, especially after I am spoiled by Polars expressions. Looking forward to this great feature, thanks!
It would be really handy to have streaming support for reading query results. My use case:
pl.scan_database(...).stuff().sink_parquet("stuff.parquet")
Would also love to have scan_database
with sink support. It would be super useful for exporting from ODBC
I imagine that this would be a very big undertaking as it would mean converting Polars expressions to multiple SQL dialects.
Just passing to add: With SQLAlchemy this is a solved problem.
We "just" need to convert polars expressions to SQLALchemy queries. To quote the documentation:
The SQL Expression Language is a toolkit (...) which provides a system of constructing SQL expressions represented by composable objects, which can then be “executed” against a target database within the scope of a specific transaction, returning a result set.
Several examples of the expressive power of this language (with JOINs, GROUP BY, CTEs) can be found in SQLAlchemy tutorial.
If I understand correctly, having support for projection/slice/predicate pushdown in a scan_database() function would allow efficient reading of remote parquet file once combined with something like that: https://github.com/voltrondata/flight-sql-server-example
- The parquet file is loaded as a view in duckdb to be read lazily
- polars sends a SQL query to the server over arrow flight SQL
- The server sends back the arrow data
- polars ingests this arrow data via the arrow flight SQL ADBC driver
Would be great to have a sink_database
too, one use case is stream from parquet to database like https://github.com/adriangb/pgpq
It will be awesome to have lazy features when reading from DBs. this is especially important in cases where the dataset is larger-than-memory.
polars only supports a greedy interface for reading from databases, which doesn't work for larger-than-memory dataframes. This feature blocks my downstream from moving from ingesting CSV files to storing the data in a better place (database)
There is now an IO plugin so that database scanners can be made as extensions which wouldn't impact the overall polars binary. Here's an example https://github.com/pola-rs/pyo3-polars/tree/main/example/io_plugin
@ritchie46 should this be closed as something that should be added as an IO plugin?