lance icon indicating copy to clipboard operation
lance copied to clipboard

datafusion-python integration

Open westonpace opened this issue 11 months ago • 5 comments

The datafusion-python project recently added support for "foreign table providers" in https://github.com/apache/datafusion-python/pull/921.

We should be able to utilize this to create a foreign table provider from lance. This would make it very easy to query lance datasets using python and would be comparable to our duckdb integration.

westonpace avatar Jan 03 '25 21:01 westonpace

I have a question, how to expose _rowid and _rowaddr, it seems that datafusion api and duckdb don't support these pseudo columns.

chenkovsky avatar Jan 04 '25 13:01 chenkovsky

For the duckdb integration you can create a dataset with default scan options. You can't filter on the column yet unfortunately because pyarrow and datafusion have interpreted unsigned integers slightly differently in the filtering language (Substrait) and so there is a DF change needed.

def test_duckdb_filter_on_rowid(tmp_path):
    tab = pa.table({"a": [1, 2, 3], "b": [4, 5, 6]})
    ds = lance.write_dataset(tab, str(tmp_path))
    ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})
    row_ids = ds.scanner(columns=[], with_row_id=True).to_table().column(0).to_pylist()
    expected = tab.slice(1, 1)
    actual = duckdb.query(
        f"SELECT _rowid FROM ds"
    ).fetch_arrow_table()

For datafusion you choose whether you want these columns to appear when you create the table provider:

impl LanceTableProvider {
    fn new(dataset: Arc<Dataset>, with_row_id: bool, with_row_addr: bool) -> Self {
        ...
    }
    ...
}

westonpace avatar Jan 04 '25 14:01 westonpace

You can't filter on the column

cannt filter on rowid or any column ?

I tested the following ut.

def test_duckdb_rowid(tmp_path):
    duckdb = pytest.importorskip("duckdb")
    tbl = create_table_for_duckdb()
    ds = lance.write_dataset(tbl, str(tmp_path))
    ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})
    duckdb.query("SELECT id, meta, price FROM ds WHERE id==1000").to_df() # error
    duckdb.query("SELECT _rowid, meta, price FROM ds WHERE id==1000").to_df() # error
    duckdb.query("SELECT _rowid, id, meta, price FROM ds").to_df() # error
    duckdb.query("SELECT id, meta, price FROM ds").to_df() # OK

chenkovsky avatar Jan 04 '25 14:01 chenkovsky

impl LanceTableProvider {

Yes, with_row_id, with_row_addr these flags will always work. but I think spark's SupportsMetadataColumns interface is much better.

chenkovsky avatar Jan 04 '25 14:01 chenkovsky

I created a PR for datafusion to illustrate my idea for _rowid support https://github.com/apache/datafusion/pull/14057

chenkovsky avatar Jan 09 '25 14:01 chenkovsky

Closed by https://github.com/lancedb/lance/pull/3649

westonpace avatar May 15 '25 22:05 westonpace