Integration with `polars`
Hey there, thanks for building ducklake! Very exciting project!
I was wondering if any thought has already been given to integration with other parquet-consuming engines such as polars. My question is rooted in use cases where I could really use a lightweight data catalog, but a fair amount of processing is implemented in polars, which I am happy with and don't want to change. My ideal situation would be to be able to query ducklake table snapshots in a way that is funcationally equivalent to pl.scan_parquet. Being able to achieve this would be a big step towards unifying data sourcing for production-like code, which often uses polars, and analytical queries, which are often written in SQL.
I have considered:
- Query using duckdb, and then convert results to polars via something like
df = duckdb.sql("select * from mytable").pl(), which I assume supports all ducklake features such as snapshot versioning. I think this works, but is not entirely desirable because it requires an eager load of the table into memory. If the next line isdf.filter(pl.col("somecol")<1), that filter is not propagated to parquet, which it would be withpl.scan_parquet. - Write filter logic in SQL. Can be done, but feels suboptimal because it causes a tooling break and is not automatic based on existing polars logic.
- Access the
ducklakeparquets directly withpl.scan_parquet. IIUC, this would only work out of the box if a table was only written once. As soon as multiple inserts / updates / deletes come into play, the multiple parquet files cannot be combined trivially.
Are there other good options? Has this type of integration been considered already?
This would be a great feature to have. Based on my tests, pl.scan_parquet() on the list of parquet files (Option 2 below) is at least 3x faster than directly querying ducklake with the same filters (Option 1) :
Option 1: Query Ducklake:
sql = """
SELECT * from ducklake_table
WHERE col1 = 'A0' and col2 = 'abc'
"""
df = con.execute(sql).pl()
print(df)
Option 2: Query Ducklake using polars scan_parquet
sql = f"""
SELECT path FROM {meta_schema}.ducklake_data_file
JOIN {meta_schema}.ducklake_table
ON {meta_schema}.ducklake_data_file.table_id = {meta_schema}.ducklake_table.table_id
WHERE {meta_schema}.ducklake_table.table_name = 'ducklake_table';
"""
df = con.execute(sql).pl()
paths = [data_path.rstrip("/") + "/" + row[0] for row in df.rows()]
df = (
pl.scan_parquet(paths)
.filter(
(pl.col("col1") == "A0") &
(pl.col("col2") == "abc")
)
.collect()
)
print(df)
I think a solution could be a Polars IO plugin that issues the DuckLake SQL with given filters. See https://docs.pola.rs/user-guide/plugins/io_plugins/ (and if someone builds this, please share)
Thanks for the good suggestion @xixixao ! @xhochy suggested the same thing to me as well. I took a crack at implementing a polars IO plugin for communicating with duckdb and ducklake here. I got to a working solution pretty quickly with this approach.
@pdet kindly ported this functionality into the duckdb python package itself here, so I consider this issue resolved :)