datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Remove `Min`/`Max` references from `AggregateStatistics`

Open alamb opened this issue 1 year ago • 1 comments

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/11151 (where we are removing special case uses of Min/Max)

Describe the solution you'd like

Reemove these cases:

https://github.com/apache/datafusion/blob/569be9eb1ba60d3e842f22de5080c732ac2206f4/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L185-L186

https://github.com/apache/datafusion/blob/569be9eb1ba60d3e842f22de5080c732ac2206f4/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L197-L198

https://github.com/apache/datafusion/blob/569be9eb1ba60d3e842f22de5080c732ac2206f4/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L235-L236

https://github.com/apache/datafusion/blob/569be9eb1ba60d3e842f22de5080c732ac2206f4/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L247-L248

Specifically, the idea is to remove the pattern

       expr.as_any().downcast_ref::<Max>()

and

       expr.as_any().downcast_ref::<Min>()

Describe alternatives you've considered

I suggest adding a general purpose function to AggregateExec and then implementing it for Min and Max (so we can do the same for Min/Max UDAF)

impl AggregateExpr {

 /// Return the value of the aggregate function, if known, given the number of input rows.
 /// 
 /// Return None if the value can not be determined solely from the input.
 /// 
 /// # Examples
 /// * The `COUNT` aggregate would return `Some(11)` given `num_rows = 11`
 /// * The `MIN` aggregate would return `Some(Null)` given `num_rows = 0
 /// * The `MIN` aggregate would return `None` given num_rows = 11
 fn output_from_rows(&self) -> Option<ScalarValue> { None }
...
}

### Additional context

_No response_

alamb avatar Jun 28 '24 00:06 alamb

take

Rachelint avatar Jun 28 '24 02:06 Rachelint

Thanks @Rachelint -- I also have some time later this week to help with this issue (if it might help to have one or two examples as we work through https://github.com/apache/datafusion/issues/11151)

alamb avatar Jul 01 '24 13:07 alamb

Hi @alamb , I have some questions about the alternatives mentioned in the issue. It seems the interface should be like:

trait AggregateExpr {
	fn output_from_stats(&self, stats: &Statistics) -> Option<ScalarValue> { None }
	...
}

As I understand, it is a optimization takes effect when specific aggregate functions(such as max, min, count...) can get the results directly from stats?

Rachelint avatar Jul 01 '24 15:07 Rachelint

As I understand, it is a optimization takes effect when specific aggregate functions(such as max, min, count...) can get the results directly from stats?

Yes that is the case

I think output_from_stats would also be reasonable and potentially more general. One challenge is that the AggregateExpr might not have access to its argument exprs (and thus it might not know what the argument is) 🤔

alamb avatar Jul 01 '24 15:07 alamb

As I understand, it is a optimization takes effect when specific aggregate functions(such as max, min, count...) can get the results directly from stats?

Yes that is the case

I think output_from_stats would also be reasonable and potentially more general. One challenge is that the AggregateExpr might not have access to its argument exprs (and thus it might not know what the argument is) 🤔

It seems a function expressions exists, and can help? https://github.com/apache/datafusion/blob/569be9eb1ba60d3e842f22de5080c732ac2206f4/datafusion/physical-expr-common/src/aggregate/mod.rs#L121

and it is used to get min/max from stats now? https://github.com/apache/datafusion/blob/569be9eb1ba60d3e842f22de5080c732ac2206f4/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L250-L257

Rachelint avatar Jul 01 '24 16:07 Rachelint

Sounds like it might work -- I think the only way to really know for sure would be to try it out

alamb avatar Jul 01 '24 16:07 alamb

Sounds like it might work -- I think the only way to really know for sure would be to try it out

Thanks, I try it now.

Rachelint avatar Jul 01 '24 17:07 Rachelint

Sounds like it might work -- I think the only way to really know for sure would be to try it out

Thanks, I try it now.

@Rachelint if you stop working on this could you please let me know?

edmondop avatar Jul 03 '24 23:07 edmondop

Sounds like it might work -- I think the only way to really know for sure would be to try it out

Thanks, I try it now.

@Rachelint if you stop working on this could you please let me know?

Sorry for delay, still working on now, will try to push codes today...

Rachelint avatar Jul 04 '24 04:07 Rachelint

@alamb we can close this too as fixed via #12296 right?

edmondop avatar Sep 30 '24 19:09 edmondop

Thanks @edmondop

alamb avatar Sep 30 '24 20:09 alamb