spark [SPARK-39678][SQL] Improve stats estimation for v2 tables

What changes were proposed in this pull request?

We should propagate the row count stats in SizeInBytesOnlyStatsPlanVisitor if available. Row counts are propagated from connectors to spark in case of v2 tables.

Why are the changes needed?

This can improve stats estimation for v2 tables, since row count is used at places to estimate sizeInBytes.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Modified existing UT's to match the proposed behavior.

Jul 05 '22 08:07 singhpk234

cc @huaxingao @cloud-fan @wangyum

Jul 05 '22 08:07 singhpk234

Could you enable spark.sql.cbo.enabled to estimate row count?

Thanks @wangyum, I am aware of the alternate visitor we use with cbo.

I raised this pr considering :

cbo is turned off by default.
We already have rowCount propagated via LeafNodes (DSv2Relation) which are used for estimating output size in SizeInBytesOnlyStatsPlanVisitor https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L93-L100
ANALYZE is not supported for v2 tables so except row count, IMHO we can't have ndv etc. I am refering to this jira : https://issues.apache.org/jira/browse/SPARK-39420
As per my understanding v1 tables can only pass in sizeInBytes unless they have some stats in catalog. whereas v2 tables already give both from the relation itself, hence I thought it's un-accounted for v2 tables. https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L43-L45

Are you recommending it's an expected behavior / by design ?

Jul 05 '22 09:07 singhpk234

rebased and regenerated the golden files via :

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly PlanStabilitySuite"
SPARK_GENERATE_GOLDEN_FILES=1 SPARK_ANSI_SQL_MODE=true build/sbt "sql/testOnly PlanStabilitySuite"

Jul 05 '22 11:07 singhpk234

I think it's by design. So enabling spark.sql.cbo.enabled is what you want?

Jul 06 '22 00:07 wangyum

Thanks @wangyum !

So enabling spark.sql.cbo.enabled is what you want?

I believe then setting spark.sql.cbo.enabled to true by default could help, (what i wanted was to take this stat of row count, bubbled up from v2 connector to be accounted for in default spark behaviour) but I think it requires some additional efforts, since our other defaults such as auto-bhj etc needs to adjusted accordingly.

I think it's by design

for my knowledge, can you please point me to some jira's ,happy to learn more.

Love to know your thoughts on the same, Happy to close this as well if we consider this is not a problem at all.

Jul 06 '22 05:07 singhpk234

Can one of the admins verify this patch?

Jul 06 '22 08:07 AmplabJenkins

I'm a bit confused. After this PR, what's the difference between SizeInBytesOnlyStatsPlanVisitor and BasicStatsPlanVisitor?

Jul 08 '22 06:07 cloud-fan

After this PR, what's the difference between SizeInBytesOnlyStatsPlanVisitor and BasicStatsPlanVisitor

BasicStatsPlanVisitor additionally takes has columnStats such as (NDV / NullCount / min / max etc) on estimation, which generally is not passed from DSv1 / Dsv2 relation itself.

As per my understanding, prior to this PR, SizeInBytesOnlyStatsPlanVisitor was estimating stats on the subset of info i.e only sizeInBytes and BasicStatsPlanVisitor on all 3 info (sizeInBytes, rowcount,ColumStats (min /max /NDV etc), now via this PR SizeInBytesOnlyStatsPlanVisitor is estimating stats on the subset of info but this subset is now (sizeInBytes / rowCount) and BasicStatsPlanVisitor on all 3 info (sizeInBytes, rowcount,ColumStats (min /max /NDV etc).

Jul 08 '22 07:07 singhpk234

Maybe we should name them BasicStatesPlanVisitor and AdvancedStatsPlanVisitor. We also need to make sure the updated SizeInBytesOnlyStatsPlanVisitor can propagate row count properly in all cases.

BTW, with CBO off, where do we use row count?

Jul 08 '22 07:07 cloud-fan

BTW, with CBO off, where do we use row count?

we use it in places like : https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L93-L100

where we just multiply row-count with row size. We also use it for BF to create bloomFilterAgg. In v1 scenario in case of logical relation row-count can seep in from catalog stats but as you correctly pointed out it has a has a chance of row-count getting lost in places where we assume we only have sizeInBytes for example here : https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L58

Jul 08 '22 08:07 singhpk234

OK I think the idea makes sense. With CBO off, the optimizer/planner only needs size in bytes, but row count is also an important statistics to estimate size in bytes, and should be propagated in the stats plan visitor.

Jul 08 '22 08:07 cloud-fan

cc @wzhfy @c21 can you take a look first?

Jul 11 '22 05:07 cloud-fan

BTW, with CBO off, where do we use row count?

we use it in places like :

https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L93-L100

where we just multiply row-count with row size. We also use it for BF to create bloomFilterAgg. In v1 scenario in case of logical relation row-count can seep in from catalog stats but as you correctly pointed out it has a has a chance of row-count getting lost in places where we assume we only have sizeInBytes for example here :

https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L58

thought these stats are available in AQE and more accurate though

Jul 25 '22 02:07 zinking

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Nov 03 '22 00:11 github-actions[bot]

spark spark copied to clipboard

[SPARK-39678][SQL] Improve stats estimation for v2 tables

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard