spark
                                
                                
                                
                                    spark copied to clipboard
                            
                            
                            
                        [SPARK-39678][SQL] Improve stats estimation for v2 tables
What changes were proposed in this pull request?
We should propagate the row count stats in SizeInBytesOnlyStatsPlanVisitor if available. Row counts are propagated from connectors to spark in case of v2 tables.
Why are the changes needed?
This can improve stats estimation for v2 tables, since row count is used at places to estimate sizeInBytes.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Modified existing UT's to match the proposed behavior.
cc @huaxingao @cloud-fan @wangyum
Could you enable spark.sql.cbo.enabled to estimate row count?
Thanks @wangyum, I am aware of the alternate visitor we use with cbo.
I raised this pr considering :
- cbo is turned off by default.
 - We already have rowCount propagated via LeafNodes (DSv2Relation) which are used for estimating output size in SizeInBytesOnlyStatsPlanVisitor https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L93-L100
 - ANALYZE is not supported for v2 tables so except row count, IMHO we can't have ndv etc. I am refering to this jira : https://issues.apache.org/jira/browse/SPARK-39420
 - As per my understanding v1 tables can only pass in sizeInBytes unless they have some stats in catalog. whereas v2 tables already give both from the relation itself, hence I thought it's un-accounted for v2 tables. https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L43-L45
 
Are you recommending it's an expected behavior / by design ?
rebased and regenerated the golden files via :
- SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly PlanStabilitySuite"
 - SPARK_GENERATE_GOLDEN_FILES=1 SPARK_ANSI_SQL_MODE=true build/sbt "sql/testOnly PlanStabilitySuite"
 
I think it's by design. So enabling spark.sql.cbo.enabled is what you want?
Thanks @wangyum !
So enabling spark.sql.cbo.enabled is what you want?
I believe then setting spark.sql.cbo.enabled to true by default could help, (what i wanted was to take this stat of row count, bubbled up from v2 connector to be accounted for in default spark behaviour) but I think it requires some additional efforts, since our other defaults such as auto-bhj etc needs to adjusted accordingly.
I think it's by design
for my knowledge, can you please point me to some jira's ,happy to learn more.
Love to know your thoughts on the same, Happy to close this as well if we consider this is not a problem at all.
Can one of the admins verify this patch?
I'm a bit confused. After this PR, what's the difference between SizeInBytesOnlyStatsPlanVisitor and BasicStatsPlanVisitor?
After this PR, what's the difference between SizeInBytesOnlyStatsPlanVisitor and BasicStatsPlanVisitor
BasicStatsPlanVisitor additionally takes has columnStats such as (NDV / NullCount / min / max etc) on estimation, which generally is not passed from DSv1 / Dsv2 relation itself.
As per my understanding, prior to this PR, SizeInBytesOnlyStatsPlanVisitor was estimating stats on the subset of info i.e only sizeInBytes and BasicStatsPlanVisitor on all 3 info (sizeInBytes, rowcount,ColumStats (min /max /NDV etc), now via this PR SizeInBytesOnlyStatsPlanVisitor is estimating stats on the subset of info but this subset is now (sizeInBytes / rowCount) and BasicStatsPlanVisitor on all 3 info (sizeInBytes, rowcount,ColumStats (min /max /NDV etc).
Maybe we should name them BasicStatesPlanVisitor and AdvancedStatsPlanVisitor. We also need to make sure the updated SizeInBytesOnlyStatsPlanVisitor can propagate row count properly in all cases.
BTW, with CBO off, where do we use row count?
BTW, with CBO off, where do we use row count?
we use it in places like : https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L93-L100
where we just multiply row-count with row size. We also use it for BF to create bloomFilterAgg. In v1 scenario in case of logical relation row-count can seep in from catalog stats but as you correctly pointed out it has a has a chance of row-count getting lost in places where we assume we only have sizeInBytes for example here :
https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L58
OK I think the idea makes sense. With CBO off, the optimizer/planner only needs size in bytes, but row count is also an important statistics to estimate size in bytes, and should be propagated in the stats plan visitor.
cc @wzhfy @c21 can you take a look first?
BTW, with CBO off, where do we use row count?
we use it in places like :
https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L93-L100
where we just multiply row-count with row size. We also use it for BF to create bloomFilterAgg. In v1 scenario in case of logical relation row-count can seep in from catalog stats but as you correctly pointed out it has a has a chance of
row-countgetting lost in places where we assume we only have sizeInBytes for example here :https://github.com/apache/spark/blob/161c596cafea9c235b5c918d8999c085401d73a9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L54-L58
thought these stats are available in AQE and more accurate though
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!