spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-39678][SQL] Improve stats estimation for v2 tables

Open singhpk234 opened this issue 3 years ago • 13 comments

What changes were proposed in this pull request?

We should propagate the row count stats in SizeInBytesOnlyStatsPlanVisitor if available. Row counts are propagated from connectors to spark in case of v2 tables.

Why are the changes needed?

This can improve stats estimation for v2 tables, since row count is used at places to estimate sizeInBytes.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Modified existing UT's to match the proposed behavior.

singhpk234 avatar Jul 05 '22 08:07 singhpk234

cc @huaxingao @cloud-fan @wangyum

singhpk234 avatar Jul 05 '22 08:07 singhpk234

Could you enable spark.sql.cbo.enabled to estimate row count?

Thanks @wangyum, I am aware of the alternate visitor we use with cbo.

I raised this pr considering :

  1. cbo is turned off by default.
  2. We already have rowCount propagated via LeafNodes (DSv2Relation) which are used for estimating output size in SizeInBytesOnlyStatsPlanVisitor https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L93-L100
  3. ANALYZE is not supported for v2 tables so except row count, IMHO we can't have ndv etc. I am refering to this jira : https://issues.apache.org/jira/browse/SPARK-39420
  4. As per my understanding v1 tables can only pass in sizeInBytes unless they have some stats in catalog. whereas v2 tables already give both from the relation itself, hence I thought it's un-accounted for v2 tables. https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L43-L45

Are you recommending it's an expected behavior / by design ?

singhpk234 avatar Jul 05 '22 09:07 singhpk234

rebased and regenerated the golden files via :

  • SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly PlanStabilitySuite"
  • SPARK_GENERATE_GOLDEN_FILES=1 SPARK_ANSI_SQL_MODE=true build/sbt "sql/testOnly PlanStabilitySuite"

singhpk234 avatar Jul 05 '22 11:07 singhpk234

I think it's by design. So enabling spark.sql.cbo.enabled is what you want?

wangyum avatar Jul 06 '22 00:07 wangyum

Thanks @wangyum !

So enabling spark.sql.cbo.enabled is what you want?

I believe then setting spark.sql.cbo.enabled to true by default could help, (what i wanted was to take this stat of row count, bubbled up from v2 connector to be accounted for in default spark behaviour) but I think it requires some additional efforts, since our other defaults such as auto-bhj etc needs to adjusted accordingly.

I think it's by design

for my knowledge, can you please point me to some jira's ,happy to learn more.

Love to know your thoughts on the same, Happy to close this as well if we consider this is not a problem at all.

singhpk234 avatar Jul 06 '22 05:07 singhpk234

Can one of the admins verify this patch?

AmplabJenkins avatar Jul 06 '22 08:07 AmplabJenkins

I'm a bit confused. After this PR, what's the difference between SizeInBytesOnlyStatsPlanVisitor and BasicStatsPlanVisitor?

cloud-fan avatar Jul 08 '22 06:07 cloud-fan

After this PR, what's the difference between SizeInBytesOnlyStatsPlanVisitor and BasicStatsPlanVisitor

BasicStatsPlanVisitor additionally takes has columnStats such as (NDV / NullCount / min / max etc) on estimation, which generally is not passed from DSv1 / Dsv2 relation itself.

As per my understanding, prior to this PR, SizeInBytesOnlyStatsPlanVisitor was estimating stats on the subset of info i.e only sizeInBytes and BasicStatsPlanVisitor on all 3 info (sizeInBytes, rowcount,ColumStats (min /max /NDV etc), now via this PR SizeInBytesOnlyStatsPlanVisitor is estimating stats on the subset of info but this subset is now (sizeInBytes / rowCount) and BasicStatsPlanVisitor on all 3 info (sizeInBytes, rowcount,ColumStats (min /max /NDV etc).

singhpk234 avatar Jul 08 '22 07:07 singhpk234

Maybe we should name them BasicStatesPlanVisitor and AdvancedStatsPlanVisitor. We also need to make sure the updated SizeInBytesOnlyStatsPlanVisitor can propagate row count properly in all cases.

BTW, with CBO off, where do we use row count?

cloud-fan avatar Jul 08 '22 07:07 cloud-fan

BTW, with CBO off, where do we use row count?

we use it in places like : https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L93-L100

where we just multiply row-count with row size. We also use it for BF to create bloomFilterAgg. In v1 scenario in case of logical relation row-count can seep in from catalog stats but as you correctly pointed out it has a has a chance of row-count getting lost in places where we assume we only have sizeInBytes for example here : https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L58

singhpk234 avatar Jul 08 '22 08:07 singhpk234

OK I think the idea makes sense. With CBO off, the optimizer/planner only needs size in bytes, but row count is also an important statistics to estimate size in bytes, and should be propagated in the stats plan visitor.

cloud-fan avatar Jul 08 '22 08:07 cloud-fan

cc @wzhfy @c21 can you take a look first?

cloud-fan avatar Jul 11 '22 05:07 cloud-fan

BTW, with CBO off, where do we use row count?

we use it in places like :

https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L93-L100

where we just multiply row-count with row size. We also use it for BF to create bloomFilterAgg. In v1 scenario in case of logical relation row-count can seep in from catalog stats but as you correctly pointed out it has a has a chance of row-count getting lost in places where we assume we only have sizeInBytes for example here :

https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L58

thought these stats are available in AQE and more accurate though

zinking avatar Jul 25 '22 02:07 zinking

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Nov 03 '22 00:11 github-actions[bot]