delta
delta copied to clipboard
Optimize Min/Max using Delta metadata
Description
Follow up of https://github.com/delta-io/delta/issues/1192, which optimizes COUNT. This PR adds support for MIN/MAX as well.
Fix #2092
I plan to do some refactor, but I appreciate any feedback.
How was this patch tested?
Created additional unit tests to cover MIN/MAX.
Does this PR introduce any user-facing changes?
Only performance improvement
@felipepessoto just following up on this PR - is it still a WIP?
Yes, I made these changes while the SELECT Count was in review, I think I can refine this.
@scottsand-db it is ready to review. Thanks
Hi folks, did you have a chance to review this? Thanks
@scottsand-db
@vkorukanti, @scottsand-db, do you think we'll be able to complete this before 2.4 release?
@scottsand-db, @vkorukanti if you have a chance to review this please. Would be great to have this in 2.5.
And once it is completed I'd like to work on other improvements: support to DV, partitioning, group by, etc
@scottsand-db, @vkorukanti, do we still plan to go ahead with these improvements? Let me know to rebase the changes.
@felipepessoto - thanks for following up. We are super swamped right now getting a few final features ready for next Delta release ... we will follow up when we can!
#1763
I'm wondering if this is still on the agenda? I think it would be a wonderful enhancement.
There are many practical use cases where performance improvements on such min/max queries would make a difference. Two examples:
- when incrementally loading data to a table, often the first step is to query the max timestamp of that table in order to figure out from where to continue loading more data
- BI tools will query the min max values of columns to configure the ranges for their filters or slicers
We have some folks asking for more improvements using stats here and in other issues/PRs. I think it would help in a couple of scenarios like @henlue mentioned.
https://github.com/delta-io/delta/issues/1192 https://github.com/delta-io/delta/issues/1916 https://github.com/delta-io/delta/pull/1377
@scottsand-db, @vkorukanti, @dennyglee what would be the best way get community feedback about this? Creating a new issue and asking people to thumbs up would be useful? Is it something maintainers use to prioritize the new features?
Thanks
Hey @felipepessoto - good call out, I would probably create new issues for each one so we can garner thumbs up as well as do design reviews (if required), eh?!
@felipepessoto Sorry for the delay. This is really a great optimization to have. Could you please rebase the changes? I think this will be a great addition to Delta 3.1.
@vkorukanti I rebased it and adjusted to Spark 3.5 show()
@vkorukanti I rebased it and adjusted to Spark 3.5 show()
Thank you @felipepessoto. Bit swamped this week, but will definitely take a look at it next week. thx!
Thanks Venki. I’m out of town, I’m probably working on it next week.
I did the first refactor to address some of the comments. I resolved the comments I fixed.
I addressed all the comments, except split the "extract from stats" method, which could have some trade-off.
Test is also failing with this error:
/home/runner/work/delta/delta/spark/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala: illegal start of simple expression: Token(RPAREN,),4940,))
It doesn't happen locally. No idea what it is as it doesn't show error line. Let me know if you have any clue.
UPDATE: Found it, it doesn't like comma after the last element in a Seq. In case anybody needs, you can use this command to find the error location:
tail -c +4940 spark/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala
Thanks.
Flink tests are flaky? Previous build succeeded