datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Don't scan first column on empty projection

Open Dandandan opened this issue 3 years ago • 12 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do. Depends on: #2603

When we perform without needing the like SELECT COUNT(1) FROM table, the plan always reads the first column (whatever this is). This is inefficient: in case of formats like Parquet we can avoid scanning / reading the column and just produce the row counts. For non-columnar formats it can avoid unnecessary parsing (or implementing a fast path, i.e. only counting lines).

Projection: Count(1)
  TableScan: test projection=[a]

Should become:

Projection: Count(1)
  TableScan: test projection=[]

Describe the solution you'd like We can push the responsibility of dealing with producing an array with a certain number of rows into the individual readers / other parts of the plans. They should produce RecordBatches with the number of rows. We should remove the line projection.insert(0); from projection push down.

Describe alternatives you've considered

Additional context Some queries in the ClickBench benchmark show this performance issue (https://benchmark.clickhouse.com/ ):

| logical_plan  | Projection: #COUNT(UInt8(1))                                                                                                       |
|               |   Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]                                                                                |
|               |     TableScan: hits projection=[WatchID]                                                                                           |

Dandandan avatar Aug 21 '22 09:08 Dandandan

👍 this is an important optimization as select count(*) type queries are so common

alamb avatar Aug 21 '22 10:08 alamb

I find this comment: https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/projection_push_down.rs#L98-L100

It says that Ensure that we are reading at least one column from the table. Is there any reason or background of why we need to do this?

HaoYang670 avatar Aug 29 '22 06:08 HaoYang670

I find this comment: https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/projection_push_down.rs#L98-L100

It says that Ensure that we are reading at least one column from the table. Is there any reason or background of why we need to do this?

The reason is that several Arrow readers don´t support empty projections. I added a PR for csv / json upstream https://github.com/apache/arrow-rs/pull/2604

Dandandan avatar Aug 29 '22 07:08 Dandandan

The reason is that several Arrow readers don´t support empty projections.

Thank you, @Dandandan. I could reproduce the error when reading csv with empty projection

Arrow error: Invalid argument error: must either specify a row count or at least one column

If this depends on the support of arrow-rs, should we add a new label such as arrow-dependency for this issue?

HaoYang670 avatar Aug 29 '22 07:08 HaoYang670

Might count(*) be as simple as a stats lookup in Parquet or DeltaLake? Reading a billion values just to count them seems sub-optimal, but that can definitely be addressed with a TODO and a future PR.

avantgardnerio avatar Aug 29 '22 15:08 avantgardnerio

Might count(*) be as simple as a stats lookup in Parquet or DeltaLake? Reading a billion values just to count them seems sub-optimal, but that can definitely be addressed with a TODO and a future PR.

You're right, for a schema provider that has statistics available, we can skip scanning. AFAIK DataFusion already has support for using the statistics-provided count/min/max from the provider (e.g. delta lake).

You're right that we could also use the parquet statistics for files instead of skipping reading the columns. I think we don't support this yet. At least for min/max statisticd his avoids having to scan the entire column and compute the min/max.

Dandandan avatar Aug 29 '22 17:08 Dandandan

I think @tustvold has been thinking of this in the context of the various parquet reader improvements

alamb avatar Aug 31 '22 13:08 alamb

I think there are two different optimisations being discussed here:

  • Skip interacting with the file based on catalog statistics if available
  • Remove projection "hack" and delegate to file readers

Parquet has supported the latter since https://github.com/apache/arrow-rs/pull/1560, and CSV/JSON will support it once https://github.com/apache/arrow-rs/pull/2604 is released. I think it should be then be possible to remove the workaround, as it will be no longer necessary.

As to the former, I think it should be fairly straightforward to implement a physical optimiser pass that uses statistics to simplify counts into projections based on statistics if available. I had thought we had already implemented this tbh... :thinking: Edit: Yup AggregateStatistics

tustvold avatar Aug 31 '22 14:08 tustvold

Remove projection "hack" and delegate to file readers

Yes, this is what I was talking about. https://docs.rs/datafusion/latest/datafusion/physical_optimizer/aggregate_statistics/struct.AggregateStatistics.html is very cool 👍 (thanks @rdettai !)

alamb avatar Aug 31 '22 15:08 alamb

Draft PR here: https://github.com/apache/arrow-datafusion/pull/3382 It turns out it is a bit more complex than removing a line, as every Exec node should support producing records without columns/empty schema. I think the only thing we can do is hunting every RecordBatch::try_new and adapting it for projections without columns :thinking:

Dandandan avatar Sep 07 '22 06:09 Dandandan

Maybe we can teach https://docs.rs/arrow/22.0.0/arrow/datatypes/struct.Schema.html#method.project and https://docs.rs/arrow/22.0.0/arrow/record_batch/struct.RecordBatch.html#method.project about empty projections?

alamb avatar Sep 08 '22 20:09 alamb

Maybe we can teach https://docs.rs/arrow/22.0.0/arrow/datatypes/struct.Schema.html#method.project and https://docs.rs/arrow/22.0.0/arrow/record_batch/struct.RecordBatch.html#method.project about empty projections?

Thanks, I did just that yesterday, for RecordBach::project: https://github.com/apache/arrow-rs/pull/2691. Schema::project already seems to handle empty projections just fine 🎉

Dandandan avatar Sep 09 '22 05:09 Dandandan

Closed by #7920

Dandandan avatar Oct 25 '23 07:10 Dandandan