delta icon indicating copy to clipboard operation
delta copied to clipboard

Optimize Min/Max using Delta metadata

Open felipepessoto opened this issue 2 years ago • 19 comments

Description

Follow up of https://github.com/delta-io/delta/issues/1192, which optimizes COUNT. This PR adds support for MIN/MAX as well.

Fix #2092

I plan to do some refactor, but I appreciate any feedback.

How was this patch tested?

Created additional unit tests to cover MIN/MAX.

Does this PR introduce any user-facing changes?

Only performance improvement

felipepessoto avatar Dec 17 '22 02:12 felipepessoto

@felipepessoto just following up on this PR - is it still a WIP?

scottsand-db avatar Jan 03 '23 14:01 scottsand-db

Yes, I made these changes while the SELECT Count was in review, I think I can refine this.

felipepessoto avatar Jan 03 '23 23:01 felipepessoto

@scottsand-db it is ready to review. Thanks

felipepessoto avatar Jan 28 '23 01:01 felipepessoto

Hi folks, did you have a chance to review this? Thanks

felipepessoto avatar Apr 13 '23 22:04 felipepessoto

@scottsand-db

felipepessoto avatar Apr 20 '23 22:04 felipepessoto

@vkorukanti, @scottsand-db, do you think we'll be able to complete this before 2.4 release?

felipepessoto avatar May 01 '23 00:05 felipepessoto

@scottsand-db, @vkorukanti if you have a chance to review this please. Would be great to have this in 2.5.

And once it is completed I'd like to work on other improvements: support to DV, partitioning, group by, etc

felipepessoto avatar May 25 '23 22:05 felipepessoto

@scottsand-db, @vkorukanti, do we still plan to go ahead with these improvements? Let me know to rebase the changes.

felipepessoto avatar Jun 16 '23 17:06 felipepessoto

@felipepessoto - thanks for following up. We are super swamped right now getting a few final features ready for next Delta release ... we will follow up when we can!

scottsand-db avatar Jun 16 '23 17:06 scottsand-db

#1763

felipepessoto avatar Jun 29 '23 19:06 felipepessoto

I'm wondering if this is still on the agenda? I think it would be a wonderful enhancement.

There are many practical use cases where performance improvements on such min/max queries would make a difference. Two examples:

  • when incrementally loading data to a table, often the first step is to query the max timestamp of that table in order to figure out from where to continue loading more data
  • BI tools will query the min max values of columns to configure the ranges for their filters or slicers

henlue avatar Aug 31 '23 13:08 henlue

We have some folks asking for more improvements using stats here and in other issues/PRs. I think it would help in a couple of scenarios like @henlue mentioned.

https://github.com/delta-io/delta/issues/1192 https://github.com/delta-io/delta/issues/1916 https://github.com/delta-io/delta/pull/1377

@scottsand-db, @vkorukanti, @dennyglee what would be the best way get community feedback about this? Creating a new issue and asking people to thumbs up would be useful? Is it something maintainers use to prioritize the new features?

Thanks

felipepessoto avatar Sep 22 '23 17:09 felipepessoto

Hey @felipepessoto - good call out, I would probably create new issues for each one so we can garner thumbs up as well as do design reviews (if required), eh?!

dennyglee avatar Sep 24 '23 21:09 dennyglee

@felipepessoto Sorry for the delay. This is really a great optimization to have. Could you please rebase the changes? I think this will be a great addition to Delta 3.1.

vkorukanti avatar Oct 27 '23 21:10 vkorukanti

@vkorukanti I rebased it and adjusted to Spark 3.5 show()

felipepessoto avatar Oct 28 '23 10:10 felipepessoto

@vkorukanti I rebased it and adjusted to Spark 3.5 show()

Thank you @felipepessoto. Bit swamped this week, but will definitely take a look at it next week. thx!

vkorukanti avatar Nov 01 '23 22:11 vkorukanti

Thanks Venki. I’m out of town, I’m probably working on it next week.

felipepessoto avatar Nov 14 '23 19:11 felipepessoto

I did the first refactor to address some of the comments. I resolved the comments I fixed.

felipepessoto avatar Nov 22 '23 01:11 felipepessoto

I addressed all the comments, except split the "extract from stats" method, which could have some trade-off.

Test is also failing with this error:

/home/runner/work/delta/delta/spark/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala: illegal start of simple expression: Token(RPAREN,),4940,))

It doesn't happen locally. No idea what it is as it doesn't show error line. Let me know if you have any clue.

UPDATE: Found it, it doesn't like comma after the last element in a Seq. In case anybody needs, you can use this command to find the error location:

tail -c +4940 spark/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala

Thanks.

felipepessoto avatar Nov 22 '23 12:11 felipepessoto

Flink tests are flaky? Previous build succeeded

felipepessoto avatar Jan 09 '24 03:01 felipepessoto