hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-27082: AggregateStatsCache.findBestMatch() tests the inclusion of default partition name.

Open ngsg opened this issue 2 years ago • 7 comments

What changes were proposed in this pull request?

AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of default partition name __HIVE_DEFAULT_PARTITION__.

Why are the changes needed?

To avoid the non-deterministic behavior of Hive in generating DAGs.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually tested with TPC-DS 10TB

ngsg avatar Apr 17 '23 09:04 ngsg

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

sonarqubecloud[bot] avatar Apr 17 '23 12:04 sonarqubecloud[bot]

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.

github-actions[bot] avatar Jun 17 '23 00:06 github-actions[bot]

reopened this @ngsg : is it possible to verify this with a qtest (including explain showing the statistics)?

abstractdog avatar Jun 24 '23 12:06 abstractdog

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
No Duplication information No Duplication information

sonarqubecloud[bot] avatar Jun 24 '23 13:06 sonarqubecloud[bot]

Hi @abstractdog. I looked for our documentation to reproduce the situation we met, and unfortunately, there was insufficient information about this issue. To write a qfile from scratch, I need to study about statistics related syntax that can be used to verify this patch, but I don't have enough time recently. So I'm sorry to say but it will take some time for me to write a qfile for this issue.

ngsg avatar Jun 29 '23 11:06 ngsg

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.

github-actions[bot] avatar Aug 29 '23 00:08 github-actions[bot]

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
No Duplication information No Duplication information

warning The version of Java (11.0.8) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17. Read more here

sonarqubecloud[bot] avatar Sep 06 '23 09:09 sonarqubecloud[bot]

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.

github-actions[bot] avatar Oct 30 '24 00:10 github-actions[bot]

hi @ngsg, we have tpcds30tb metadata qtests. I think they should be good candidate for the stats checks. If you were able to repro the issue with TPC-DS 10TB, could you try the same testcase with TestTezTPCDS30TBPerfCliDriver?

deniskuzZ avatar Dec 17 '24 21:12 deniskuzZ

@deniskuzZ , Thank you for your advice. I tried to reproduce the issue using TestTezTPCDS30TBPerfCliDriver, but I couldn't observe any query plan changes so far.

I followed these steps to reproduce the issue:

  1. Set hive.metastore.aggregate.stats.cache.enabled to true in data/conf/perf/tpcds30tb/tez/hive-site.xml. (Since AggregateStatsCache is a MetaStore feature, we could not enable it using qfile.)
  2. Run query5 alone and check query plan.
  3. Run queries 64-99, followed by queries 1-5 and then check query plan. According to our records, the plan of query5 differs between step 2 and step 3. However, TestTezTPCDS30TBPerfCliDriver generated the same plan.

At the moment, I’m tied up with other tasks, so I won’t be able to dedicate time to this issue in the near future. Additionally, it seems that I need more time to understand our records as this issue is quite old. However, I’ll prioritize this issue at the top of my Hive TODO list and will share any updates as soon as I discover new information.

ngsg avatar Dec 20 '24 10:12 ngsg

Thank you to @abstractdog and @deniskuzZ for reviewing the patch. After studying the issue again, I have concluded that the proposed patch is insufficient to fully address the issue, so I have decided to close this PR.

We reviewed the issue and determined that this patch could be helpful for datasets where the null-distribution is similar to TPC-DS data. Specifically, if numNulls is closely related to the default partition(null partition), this improves the accuracy of the numNulls returned by AggregateStatsCache. However, not all datasets have a null distribution similar to TPC-DS, meaning this patch isn't universally applicable.

Additionally, it appears that the non-deterministic behavior we encountered is unrelated to numNulls. We observed that AggregateStatsCache makes SemiJoin branch removal non-deterministic, but it uses numRows and NDV, not numNulls. Therefore, I think this non-deterministic behavior is not suitable for verifying the patch.

While some optimizers may depend on numNulls, I have not yet been able to write a qfile to properly verify this patch. As a result, I believe it is best to close the PR for now and reopen it once I have either a more effective patch or a feasible qfile to verify this patch.

ngsg avatar Jan 08 '25 12:01 ngsg