ducklake icon indicating copy to clipboard operation
ducklake copied to clipboard

Integration with `polars`

Open AndreasAlbertQC opened this issue 6 months ago • 3 comments

Hey there, thanks for building ducklake! Very exciting project!

I was wondering if any thought has already been given to integration with other parquet-consuming engines such as polars. My question is rooted in use cases where I could really use a lightweight data catalog, but a fair amount of processing is implemented in polars, which I am happy with and don't want to change. My ideal situation would be to be able to query ducklake table snapshots in a way that is funcationally equivalent to pl.scan_parquet. Being able to achieve this would be a big step towards unifying data sourcing for production-like code, which often uses polars, and analytical queries, which are often written in SQL.

I have considered:

  1. Query using duckdb, and then convert results to polars via something like df = duckdb.sql("select * from mytable").pl(), which I assume supports all ducklake features such as snapshot versioning. I think this works, but is not entirely desirable because it requires an eager load of the table into memory. If the next line is df.filter(pl.col("somecol")<1), that filter is not propagated to parquet, which it would be with pl.scan_parquet.
  2. Write filter logic in SQL. Can be done, but feels suboptimal because it causes a tooling break and is not automatic based on existing polars logic.
  3. Access the ducklake parquets directly with pl.scan_parquet. IIUC, this would only work out of the box if a table was only written once. As soon as multiple inserts / updates / deletes come into play, the multiple parquet files cannot be combined trivially.

Are there other good options? Has this type of integration been considered already?

AndreasAlbertQC avatar Jun 02 '25 10:06 AndreasAlbertQC

This would be a great feature to have. Based on my tests, pl.scan_parquet() on the list of parquet files (Option 2 below) is at least 3x faster than directly querying ducklake with the same filters (Option 1) :

Option 1: Query Ducklake:

sql = """
  SELECT * from ducklake_table
  WHERE col1 = 'A0' and col2 = 'abc'
"""

df = con.execute(sql).pl()
print(df)

Option 2: Query Ducklake using polars scan_parquet

sql = f"""
SELECT path FROM {meta_schema}.ducklake_data_file
JOIN {meta_schema}.ducklake_table
  ON {meta_schema}.ducklake_data_file.table_id = {meta_schema}.ducklake_table.table_id
WHERE {meta_schema}.ducklake_table.table_name = 'ducklake_table';
"""
df = con.execute(sql).pl()


paths = [data_path.rstrip("/") + "/" + row[0] for row in df.rows()]


df = (
    pl.scan_parquet(paths)
    .filter(
        (pl.col("col1") == "A0") &
        (pl.col("col2") == "abc")
    )
    .collect()
)

print(df)

kumarspark avatar Jun 03 '25 02:06 kumarspark

I think a solution could be a Polars IO plugin that issues the DuckLake SQL with given filters. See https://docs.pola.rs/user-guide/plugins/io_plugins/ (and if someone builds this, please share)

xixixao avatar Jun 03 '25 19:06 xixixao

Thanks for the good suggestion @xixixao ! @xhochy suggested the same thing to me as well. I took a crack at implementing a polars IO plugin for communicating with duckdb and ducklake here. I got to a working solution pretty quickly with this approach.

AndreasAlbertQC avatar Jun 05 '25 15:06 AndreasAlbertQC

@pdet kindly ported this functionality into the duckdb python package itself here, so I consider this issue resolved :)

AndreasAlbertQC avatar Jun 25 '25 16:06 AndreasAlbertQC