ducklake Integration with `polars`

Hey there, thanks for building ducklake! Very exciting project!

I was wondering if any thought has already been given to integration with other parquet-consuming engines such as polars. My question is rooted in use cases where I could really use a lightweight data catalog, but a fair amount of processing is implemented in polars, which I am happy with and don't want to change. My ideal situation would be to be able to query ducklake table snapshots in a way that is funcationally equivalent to pl.scan_parquet. Being able to achieve this would be a big step towards unifying data sourcing for production-like code, which often uses polars, and analytical queries, which are often written in SQL.

I have considered:

Query using duckdb, and then convert results to polars via something like df = duckdb.sql("select * from mytable").pl(), which I assume supports all ducklake features such as snapshot versioning. I think this works, but is not entirely desirable because it requires an eager load of the table into memory. If the next line is df.filter(pl.col("somecol")<1), that filter is not propagated to parquet, which it would be with pl.scan_parquet.
Write filter logic in SQL. Can be done, but feels suboptimal because it causes a tooling break and is not automatic based on existing polars logic.
Access the ducklake parquets directly with pl.scan_parquet. IIUC, this would only work out of the box if a table was only written once. As soon as multiple inserts / updates / deletes come into play, the multiple parquet files cannot be combined trivially.

Are there other good options? Has this type of integration been considered already?

Jun 02 '25 10:06 AndreasAlbertQC

This would be a great feature to have. Based on my tests, pl.scan_parquet() on the list of parquet files (Option 2 below) is at least 3x faster than directly querying ducklake with the same filters (Option 1) :

Option 1: Query Ducklake:

sql = """
  SELECT * from ducklake_table
  WHERE col1 = 'A0' and col2 = 'abc'
"""

df = con.execute(sql).pl()
print(df)

Option 2: Query Ducklake using polars scan_parquet

sql = f"""
SELECT path FROM {meta_schema}.ducklake_data_file
JOIN {meta_schema}.ducklake_table
  ON {meta_schema}.ducklake_data_file.table_id = {meta_schema}.ducklake_table.table_id
WHERE {meta_schema}.ducklake_table.table_name = 'ducklake_table';
"""
df = con.execute(sql).pl()


paths = [data_path.rstrip("/") + "/" + row[0] for row in df.rows()]


df = (
    pl.scan_parquet(paths)
    .filter(
        (pl.col("col1") == "A0") &
        (pl.col("col2") == "abc")
    )
    .collect()
)

print(df)

Jun 03 '25 02:06 kumarspark

I think a solution could be a Polars IO plugin that issues the DuckLake SQL with given filters. See https://docs.pola.rs/user-guide/plugins/io_plugins/ (and if someone builds this, please share)

Jun 03 '25 19:06 xixixao

Thanks for the good suggestion @xixixao ! @xhochy suggested the same thing to me as well. I took a crack at implementing a polars IO plugin for communicating with duckdb and ducklake here. I got to a working solution pretty quickly with this approach.

Jun 05 '25 15:06 AndreasAlbertQC

@pdet kindly ported this functionality into the duckdb python package itself here, so I consider this issue resolved :)

Jun 25 '25 16:06 AndreasAlbertQC