Daft
Daft copied to clipboard
parquet metadata only queries perform full scans
Describe the bug
queries such as read_parquet().count_rows()
should not do a full scan, and instead be should able to be fulfilled by the metadata only.
The need for the full scan should be optimized away during physical planning.
To Reproduce
Steps to reproduce the behavior:
daft.read_parquet().count_rows() **Expected behavior** Metadata only operations such as
count(*)or
count_rows()` usually can be fulfilled without needing to perform a full scan.
Additional context
dump from daft.read_parquet('lineitem.parquet').explain(show_all=True)
== Unoptimized Logical Plan ==
* GlobScanOperator
| Glob paths = [../Daft/lineitem.parquet]
| Coerce int96 timestamp unit = Nanoseconds
| IO config = S3 config = { Max connections = 8, Retry initial backoff ms = 1000, Connect timeout ms = 30000, Read timeout ms = 30000, Max retries = 25, Retry mode = adaptive, Anonymous = false, Use SSL = true, Verify SSL = true, Check hostname SSL = true, Requester pays = false, Force Virtual Addressing = false }, Azure config = { Anoynmous = false, Use SSL = true }, GCS config = { Anoynmous = false }
| Use multithreading = true
| File schema = l_orderkey#Int64, l_partkey#Int64, l_suppkey#Int64, l_linenumber#Int64, l_quantity#Int64, l_extendedprice#Float64, l_discount#Float64, l_tax#Float64, l_returnflag#Utf8, l_linestatus#Utf8, l_shipdate#Date, l_commitdate#Date, l_receiptdate#Date, l_shipinstruct#Utf8, l_shipmode#Utf8, comments#Utf8
| Partitioning keys = []
| Output schema = l_orderkey#Int64, l_partkey#Int64, l_suppkey#Int64, l_linenumber#Int64, l_quantity#Int64, l_extendedprice#Float64, l_discount#Float64, l_tax#Float64, l_returnflag#Utf8, l_linestatus#Utf8, l_shipdate#Date, l_commitdate#Date, l_receiptdate#Date, l_shipinstruct#Utf8, l_shipmode#Utf8, comments#Utf8
== Optimized Logical Plan ==
* GlobScanOperator
| Glob paths = [../Daft/lineitem.parquet]
| Coerce int96 timestamp unit = Nanoseconds
| IO config = S3 config = { Max connections = 8, Retry initial backoff ms = 1000, Connect timeout ms = 30000, Read timeout ms = 30000, Max retries = 25, Retry mode = adaptive, Anonymous = false, Use SSL = true, Verify SSL = true, Check hostname SSL = true, Requester pays = false, Force Virtual Addressing = false }, Azure config = { Anoynmous = false, Use SSL = true }, GCS config = { Anoynmous = false }
| Use multithreading = true
| File schema = l_orderkey#Int64, l_partkey#Int64, l_suppkey#Int64, l_linenumber#Int64, l_quantity#Int64, l_extendedprice#Float64, l_discount#Float64, l_tax#Float64, l_returnflag#Utf8, l_linestatus#Utf8, l_shipdate#Date, l_commitdate#Date, l_receiptdate#Date, l_shipinstruct#Utf8, l_shipmode#Utf8, comments#Utf8
| Partitioning keys = []
| Output schema = l_orderkey#Int64, l_partkey#Int64, l_suppkey#Int64, l_linenumber#Int64, l_quantity#Int64, l_extendedprice#Float64, l_discount#Float64, l_tax#Float64, l_returnflag#Utf8, l_linestatus#Utf8, l_shipdate#Date, l_commitdate#Date, l_receiptdate#Date, l_shipinstruct#Utf8, l_shipmode#Utf8, comments#Utf8
== Physical Plan ==
* TabularScan:
| Num Scan Tasks = 21
| Estimated Scan Bytes = 2102749617
| Clustering spec = { Num partitions = 21 }