lance
lance copied to clipboard
filtering on timestamp column not working when tried with duckdb
Tried filtering on a timestamp column using duckdb against lance, and it did not work. The same query worked against a parquet file.
import pyarrow as pa
import lancedb
import duckdb
from datetime import datetime, timedelta
data = pa.table({
"c1": ["A", "B"],
"c2": ["a", "b"],
"c3": [1, 2],
"c4": [datetime(2021,6,16),datetime(2021,6,17)]
})
db = lancedb.connect("lancedb_files")
db.create_table("lance_table", data)
table = db.open_table("lance_table")
arrow_dataset = table.to_lance()
duckdb.query("select * from arrow_dataset where c4='2021-06-16 00:00:00.000000'").df()
| index | c1 | c2 | c3 | c4 |
|---|
pa.dataset.write_dataset(data, "parquet_files", format='parquet')
duckdb.query("select * from 'parquet_files/*.parquet' where c4='2021-06-16 00:00:00.000000'").df()
| index | c1 | c2 | c3 | c4 |
|---|---|---|---|---|
| 0 | A | a | 1 | 2021-06-16 00:00:00 |
arrow_dataset.schema
c1: string c2: string c3: int64 c4: timestamp[us]
Not sure if this is a bug that is to be reported here, or if it should be filed with duckdb .
I can confirm I can repro when querying lance using duckdb. I have also confirmed that lance timestamp filter is working correctly
In [11]: arrow_dataset.scanner(filter="c4=timestamp '2021-06-16 00:00:00.000000'").to_table().to_pandas()
Out[11]:
c1 c2 c3 c4
0 A a 1 2021-06-16
In [12]: duckdb.query("select * from arrow_dataset where c4=timestamp '2021-06-16 00:00:00.000000'").df()
Out[12]:
Empty DataFrame
Columns: [c1, c2, c3, c4]
Index: []
the issue is most likely somewhere in how we translate filters from duckdb into lance filters.
Yeah seems like an issue with how the substrait translation is happening @westonpace