datafusion
                                
                                 datafusion copied to clipboard
                                
                                    datafusion copied to clipboard
                            
                            
                            
                        Don't scan first column on empty projection
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Depends on: #2603
When we perform without needing the  like SELECT COUNT(1) FROM table, the plan always reads the first column (whatever this is). This is inefficient: in case of formats like Parquet we can avoid scanning / reading the column and just produce the row counts. For non-columnar formats it can avoid unnecessary parsing (or implementing a fast path, i.e. only counting lines).
Projection: Count(1)
  TableScan: test projection=[a]
Should become:
Projection: Count(1)
  TableScan: test projection=[]
Describe the solution you'd like
We can push the responsibility of dealing with producing an array with a certain number of rows into the individual readers / other parts of the plans. They should produce RecordBatches with the number of rows.
We should remove the line projection.insert(0); from projection push down.
Describe alternatives you've considered
Additional context Some queries in the ClickBench benchmark show this performance issue (https://benchmark.clickhouse.com/ ):
| logical_plan  | Projection: #COUNT(UInt8(1))                                                                                                       |
|               |   Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]                                                                                |
|               |     TableScan: hits projection=[WatchID]                                                                                           |
👍  this is an important optimization as select count(*) type queries are so common
I find this comment: https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/projection_push_down.rs#L98-L100
It says that Ensure that we are reading at least one column from the table. Is there any reason or background of why we need to do this?
I find this comment: https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/projection_push_down.rs#L98-L100
It says that
Ensure that we are reading at least one column from the table. Is there any reason or background of why we need to do this?
The reason is that several Arrow readers don´t support empty projections. I added a PR for csv / json upstream https://github.com/apache/arrow-rs/pull/2604
The reason is that several Arrow readers don´t support empty projections.
Thank you, @Dandandan. I could reproduce the error when reading csv with empty projection
Arrow error: Invalid argument error: must either specify a row count or at least one column
If this depends on the support of arrow-rs, should we add a new label such as arrow-dependency for this issue?
Might count(*) be as simple as a stats lookup in Parquet or DeltaLake? Reading a billion values just to count them seems sub-optimal, but that can definitely be addressed with a TODO and a future PR.
Might
count(*)be as simple as a stats lookup in Parquet or DeltaLake? Reading a billion values just to count them seems sub-optimal, but that can definitely be addressed with a TODO and a future PR.
You're right, for a schema provider that has statistics available, we can skip scanning. AFAIK DataFusion already has support for using the statistics-provided count/min/max from the provider (e.g. delta lake).
You're right that we could also use the parquet statistics for files instead of skipping reading the columns. I think we don't support this yet. At least for min/max statisticd his avoids having to scan the entire column and compute the min/max.
I think @tustvold has been thinking of this in the context of the various parquet reader improvements
I think there are two different optimisations being discussed here:
- Skip interacting with the file based on catalog statistics if available
- Remove projection "hack" and delegate to file readers
Parquet has supported the latter since https://github.com/apache/arrow-rs/pull/1560, and CSV/JSON will support it once https://github.com/apache/arrow-rs/pull/2604 is released. I think it should be then be possible to remove the workaround, as it will be no longer necessary.
As to the former, I think it should be fairly straightforward to implement a physical optimiser pass that uses statistics to simplify counts into projections based on statistics if available. I had thought we had already implemented this tbh... :thinking: Edit: Yup AggregateStatistics
Remove projection "hack" and delegate to file readers
Yes, this is what I was talking about. https://docs.rs/datafusion/latest/datafusion/physical_optimizer/aggregate_statistics/struct.AggregateStatistics.html is very cool 👍 (thanks @rdettai !)
Draft PR here:
https://github.com/apache/arrow-datafusion/pull/3382
It turns out it is a bit more complex than removing a line, as every Exec node should support producing records without columns/empty schema. I think the only thing we can do is hunting every RecordBatch::try_new and adapting it for projections without columns :thinking:
Maybe we can teach https://docs.rs/arrow/22.0.0/arrow/datatypes/struct.Schema.html#method.project and https://docs.rs/arrow/22.0.0/arrow/record_batch/struct.RecordBatch.html#method.project about empty projections?
Maybe we can teach https://docs.rs/arrow/22.0.0/arrow/datatypes/struct.Schema.html#method.project and https://docs.rs/arrow/22.0.0/arrow/record_batch/struct.RecordBatch.html#method.project about empty projections?
Thanks, I did just that yesterday, for RecordBach::project: https://github.com/apache/arrow-rs/pull/2691.  Schema::project already seems to handle empty projections just fine 🎉
Closed by #7920