lance
lance copied to clipboard
datafusion-python integration
The datafusion-python project recently added support for "foreign table providers" in https://github.com/apache/datafusion-python/pull/921.
We should be able to utilize this to create a foreign table provider from lance. This would make it very easy to query lance datasets using python and would be comparable to our duckdb integration.
I have a question, how to expose _rowid and _rowaddr, it seems that datafusion api and duckdb don't support these pseudo columns.
For the duckdb integration you can create a dataset with default scan options. You can't filter on the column yet unfortunately because pyarrow and datafusion have interpreted unsigned integers slightly differently in the filtering language (Substrait) and so there is a DF change needed.
def test_duckdb_filter_on_rowid(tmp_path):
tab = pa.table({"a": [1, 2, 3], "b": [4, 5, 6]})
ds = lance.write_dataset(tab, str(tmp_path))
ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})
row_ids = ds.scanner(columns=[], with_row_id=True).to_table().column(0).to_pylist()
expected = tab.slice(1, 1)
actual = duckdb.query(
f"SELECT _rowid FROM ds"
).fetch_arrow_table()
For datafusion you choose whether you want these columns to appear when you create the table provider:
impl LanceTableProvider {
fn new(dataset: Arc<Dataset>, with_row_id: bool, with_row_addr: bool) -> Self {
...
}
...
}
You can't filter on the column
cannt filter on rowid or any column ?
I tested the following ut.
def test_duckdb_rowid(tmp_path):
duckdb = pytest.importorskip("duckdb")
tbl = create_table_for_duckdb()
ds = lance.write_dataset(tbl, str(tmp_path))
ds = lance.dataset(str(tmp_path), default_scan_options={"with_row_id": True})
duckdb.query("SELECT id, meta, price FROM ds WHERE id==1000").to_df() # error
duckdb.query("SELECT _rowid, meta, price FROM ds WHERE id==1000").to_df() # error
duckdb.query("SELECT _rowid, id, meta, price FROM ds").to_df() # error
duckdb.query("SELECT id, meta, price FROM ds").to_df() # OK
impl LanceTableProvider {
Yes, with_row_id, with_row_addr these flags will always work. but I think spark's SupportsMetadataColumns interface is much better.
I created a PR for datafusion to illustrate my idea for _rowid support https://github.com/apache/datafusion/pull/14057
Closed by https://github.com/lancedb/lance/pull/3649