Daft icon indicating copy to clipboard operation
Daft copied to clipboard

parquet metadata only queries perform full scans

Open universalmind303 opened this issue 9 months ago • 1 comments

Describe the bug queries such as read_parquet().count_rows() should not do a full scan, and instead be should able to be fulfilled by the metadata only.

The need for the full scan should be optimized away during physical planning.

To Reproduce Steps to reproduce the behavior: daft.read_parquet().count_rows() **Expected behavior** Metadata only operations such as count(*)orcount_rows()` usually can be fulfilled without needing to perform a full scan.

Additional context dump from daft.read_parquet('lineitem.parquet').explain(show_all=True)

== Unoptimized Logical Plan ==

* GlobScanOperator
|   Glob paths = [../Daft/lineitem.parquet]
|   Coerce int96 timestamp unit = Nanoseconds
|   IO config = S3 config = { Max connections = 8, Retry initial backoff ms = 1000, Connect timeout ms = 30000, Read timeout ms = 30000, Max retries = 25, Retry mode = adaptive, Anonymous = false, Use SSL = true, Verify SSL = true, Check hostname SSL = true, Requester pays = false, Force Virtual Addressing = false }, Azure config = { Anoynmous = false, Use SSL = true }, GCS config = { Anoynmous = false }
|   Use multithreading = true
|   File schema = l_orderkey#Int64, l_partkey#Int64, l_suppkey#Int64, l_linenumber#Int64, l_quantity#Int64, l_extendedprice#Float64, l_discount#Float64, l_tax#Float64, l_returnflag#Utf8, l_linestatus#Utf8, l_shipdate#Date, l_commitdate#Date, l_receiptdate#Date, l_shipinstruct#Utf8, l_shipmode#Utf8, comments#Utf8
|   Partitioning keys = []
|   Output schema = l_orderkey#Int64, l_partkey#Int64, l_suppkey#Int64, l_linenumber#Int64, l_quantity#Int64, l_extendedprice#Float64, l_discount#Float64, l_tax#Float64, l_returnflag#Utf8, l_linestatus#Utf8, l_shipdate#Date, l_commitdate#Date, l_receiptdate#Date, l_shipinstruct#Utf8, l_shipmode#Utf8, comments#Utf8


== Optimized Logical Plan ==

* GlobScanOperator
|   Glob paths = [../Daft/lineitem.parquet]
|   Coerce int96 timestamp unit = Nanoseconds
|   IO config = S3 config = { Max connections = 8, Retry initial backoff ms = 1000, Connect timeout ms = 30000, Read timeout ms = 30000, Max retries = 25, Retry mode = adaptive, Anonymous = false, Use SSL = true, Verify SSL = true, Check hostname SSL = true, Requester pays = false, Force Virtual Addressing = false }, Azure config = { Anoynmous = false, Use SSL = true }, GCS config = { Anoynmous = false }
|   Use multithreading = true
|   File schema = l_orderkey#Int64, l_partkey#Int64, l_suppkey#Int64, l_linenumber#Int64, l_quantity#Int64, l_extendedprice#Float64, l_discount#Float64, l_tax#Float64, l_returnflag#Utf8, l_linestatus#Utf8, l_shipdate#Date, l_commitdate#Date, l_receiptdate#Date, l_shipinstruct#Utf8, l_shipmode#Utf8, comments#Utf8
|   Partitioning keys = []
|   Output schema = l_orderkey#Int64, l_partkey#Int64, l_suppkey#Int64, l_linenumber#Int64, l_quantity#Int64, l_extendedprice#Float64, l_discount#Float64, l_tax#Float64, l_returnflag#Utf8, l_linestatus#Utf8, l_shipdate#Date, l_commitdate#Date, l_receiptdate#Date, l_shipinstruct#Utf8, l_shipmode#Utf8, comments#Utf8


== Physical Plan ==

* TabularScan:
|   Num Scan Tasks = 21
|   Estimated Scan Bytes = 2102749617
|   Clustering spec = { Num partitions = 21 }

universalmind303 avatar May 23 '24 20:05 universalmind303