datafusion Don't scan first column on empty projection

Is your feature request related to a problem or challenge? Please describe what you are trying to do. Depends on: #2603

When we perform without needing the like SELECT COUNT(1) FROM table, the plan always reads the first column (whatever this is). This is inefficient: in case of formats like Parquet we can avoid scanning / reading the column and just produce the row counts. For non-columnar formats it can avoid unnecessary parsing (or implementing a fast path, i.e. only counting lines).

Projection: Count(1)
  TableScan: test projection=[a]

Should become:

Projection: Count(1)
  TableScan: test projection=[]

Describe the solution you'd like We can push the responsibility of dealing with producing an array with a certain number of rows into the individual readers / other parts of the plans. They should produce RecordBatches with the number of rows. We should remove the line projection.insert(0); from projection push down.

Describe alternatives you've considered

Additional context Some queries in the ClickBench benchmark show this performance issue (https://benchmark.clickhouse.com/ ):

| logical_plan  | Projection: #COUNT(UInt8(1))                                                                                                       |
|               |   Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]                                                                                |
|               |     TableScan: hits projection=[WatchID]                                                                                           |

Aug 21 '22 09:08 Dandandan

👍 this is an important optimization as select count(*) type queries are so common

Aug 21 '22 10:08 alamb

I find this comment: https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/projection_push_down.rs#L98-L100

It says that Ensure that we are reading at least one column from the table. Is there any reason or background of why we need to do this?

Aug 29 '22 06:08 HaoYang670

I find this comment: https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/projection_push_down.rs#L98-L100

It says that Ensure that we are reading at least one column from the table. Is there any reason or background of why we need to do this?

The reason is that several Arrow readers don´t support empty projections. I added a PR for csv / json upstream https://github.com/apache/arrow-rs/pull/2604

Aug 29 '22 07:08 Dandandan

The reason is that several Arrow readers don´t support empty projections.

Thank you, @Dandandan. I could reproduce the error when reading csv with empty projection

Arrow error: Invalid argument error: must either specify a row count or at least one column

If this depends on the support of arrow-rs, should we add a new label such as arrow-dependency for this issue?

Aug 29 '22 07:08 HaoYang670

Might count(*) be as simple as a stats lookup in Parquet or DeltaLake? Reading a billion values just to count them seems sub-optimal, but that can definitely be addressed with a TODO and a future PR.

Aug 29 '22 15:08 avantgardnerio

Might count(*) be as simple as a stats lookup in Parquet or DeltaLake? Reading a billion values just to count them seems sub-optimal, but that can definitely be addressed with a TODO and a future PR.

You're right, for a schema provider that has statistics available, we can skip scanning. AFAIK DataFusion already has support for using the statistics-provided count/min/max from the provider (e.g. delta lake).

You're right that we could also use the parquet statistics for files instead of skipping reading the columns. I think we don't support this yet. At least for min/max statisticd his avoids having to scan the entire column and compute the min/max.

Aug 29 '22 17:08 Dandandan

I think @tustvold has been thinking of this in the context of the various parquet reader improvements

Aug 31 '22 13:08 alamb

I think there are two different optimisations being discussed here:

Skip interacting with the file based on catalog statistics if available
Remove projection "hack" and delegate to file readers

Parquet has supported the latter since https://github.com/apache/arrow-rs/pull/1560, and CSV/JSON will support it once https://github.com/apache/arrow-rs/pull/2604 is released. I think it should be then be possible to remove the workaround, as it will be no longer necessary.

As to the former, I think it should be fairly straightforward to implement a physical optimiser pass that uses statistics to simplify counts into projections based on statistics if available. I had thought we had already implemented this tbh... :thinking: Edit: Yup AggregateStatistics

Aug 31 '22 14:08 tustvold

Remove projection "hack" and delegate to file readers

Yes, this is what I was talking about. https://docs.rs/datafusion/latest/datafusion/physical_optimizer/aggregate_statistics/struct.AggregateStatistics.html is very cool 👍 (thanks @rdettai !)

Aug 31 '22 15:08 alamb

Draft PR here: https://github.com/apache/arrow-datafusion/pull/3382 It turns out it is a bit more complex than removing a line, as every Exec node should support producing records without columns/empty schema. I think the only thing we can do is hunting every RecordBatch::try_new and adapting it for projections without columns :thinking:

Sep 07 '22 06:09 Dandandan

Maybe we can teach https://docs.rs/arrow/22.0.0/arrow/datatypes/struct.Schema.html#method.project and https://docs.rs/arrow/22.0.0/arrow/record_batch/struct.RecordBatch.html#method.project about empty projections?

Sep 08 '22 20:09 alamb

Maybe we can teach https://docs.rs/arrow/22.0.0/arrow/datatypes/struct.Schema.html#method.project and https://docs.rs/arrow/22.0.0/arrow/record_batch/struct.RecordBatch.html#method.project about empty projections?

Thanks, I did just that yesterday, for RecordBach::project: https://github.com/apache/arrow-rs/pull/2691. Schema::project already seems to handle empty projections just fine 🎉

Sep 09 '22 05:09 Dandandan

Closed by #7920

Oct 25 '23 07:10 Dandandan

datafusion datafusion copied to clipboard

Don't scan first column on empty projection

datafusion
datafusion copied to clipboard