delta Optimize Min/Max using Delta metadata

Description

Follow up of https://github.com/delta-io/delta/issues/1192, which optimizes COUNT. This PR adds support for MIN/MAX as well.

Fix #2092

I plan to do some refactor, but I appreciate any feedback.

How was this patch tested?

Created additional unit tests to cover MIN/MAX.

Does this PR introduce any user-facing changes?

Only performance improvement

Dec 17 '22 02:12 felipepessoto

@felipepessoto just following up on this PR - is it still a WIP?

Jan 03 '23 14:01 scottsand-db

Yes, I made these changes while the SELECT Count was in review, I think I can refine this.

Jan 03 '23 23:01 felipepessoto

@scottsand-db it is ready to review. Thanks

Jan 28 '23 01:01 felipepessoto

Hi folks, did you have a chance to review this? Thanks

Apr 13 '23 22:04 felipepessoto

@scottsand-db

Apr 20 '23 22:04 felipepessoto

@vkorukanti, @scottsand-db, do you think we'll be able to complete this before 2.4 release?

May 01 '23 00:05 felipepessoto

@scottsand-db, @vkorukanti if you have a chance to review this please. Would be great to have this in 2.5.

And once it is completed I'd like to work on other improvements: support to DV, partitioning, group by, etc

May 25 '23 22:05 felipepessoto

@scottsand-db, @vkorukanti, do we still plan to go ahead with these improvements? Let me know to rebase the changes.

Jun 16 '23 17:06 felipepessoto

@felipepessoto - thanks for following up. We are super swamped right now getting a few final features ready for next Delta release ... we will follow up when we can!

Jun 16 '23 17:06 scottsand-db

#1763

Jun 29 '23 19:06 felipepessoto

I'm wondering if this is still on the agenda? I think it would be a wonderful enhancement.

There are many practical use cases where performance improvements on such min/max queries would make a difference. Two examples:

when incrementally loading data to a table, often the first step is to query the max timestamp of that table in order to figure out from where to continue loading more data
BI tools will query the min max values of columns to configure the ranges for their filters or slicers

Aug 31 '23 13:08 henlue

We have some folks asking for more improvements using stats here and in other issues/PRs. I think it would help in a couple of scenarios like @henlue mentioned.

https://github.com/delta-io/delta/issues/1192 https://github.com/delta-io/delta/issues/1916 https://github.com/delta-io/delta/pull/1377

@scottsand-db, @vkorukanti, @dennyglee what would be the best way get community feedback about this? Creating a new issue and asking people to thumbs up would be useful? Is it something maintainers use to prioritize the new features?

Thanks

Sep 22 '23 17:09 felipepessoto

Hey @felipepessoto - good call out, I would probably create new issues for each one so we can garner thumbs up as well as do design reviews (if required), eh?!

Sep 24 '23 21:09 dennyglee

@felipepessoto Sorry for the delay. This is really a great optimization to have. Could you please rebase the changes? I think this will be a great addition to Delta 3.1.

Oct 27 '23 21:10 vkorukanti

@vkorukanti I rebased it and adjusted to Spark 3.5 show()

Oct 28 '23 10:10 felipepessoto

@vkorukanti I rebased it and adjusted to Spark 3.5 show()

Thank you @felipepessoto. Bit swamped this week, but will definitely take a look at it next week. thx!

Nov 01 '23 22:11 vkorukanti

Thanks Venki. I’m out of town, I’m probably working on it next week.

Nov 14 '23 19:11 felipepessoto

I did the first refactor to address some of the comments. I resolved the comments I fixed.

Nov 22 '23 01:11 felipepessoto

I addressed all the comments, except split the "extract from stats" method, which could have some trade-off.

Test is also failing with this error:

/home/runner/work/delta/delta/spark/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala: illegal start of simple expression: Token(RPAREN,),4940,))

It doesn't happen locally. No idea what it is as it doesn't show error line. Let me know if you have any clue.

UPDATE: Found it, it doesn't like comma after the last element in a Seq. In case anybody needs, you can use this command to find the error location:

tail -c +4940 spark/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala

Thanks.

Nov 22 '23 12:11 felipepessoto

Flink tests are flaky? Previous build succeeded

Jan 09 '24 03:01 felipepessoto

delta delta copied to clipboard

Optimize Min/Max using Delta metadata

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

delta
delta copied to clipboard