spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-50903][CONNECT] Cache logical plans after analysis

Open ghost opened this issue 10 months ago • 1 comments

What changes were proposed in this pull request?

Update the plan cache after logical plans are analyzed.

  • The plan cache is only read-accessed during logical plan transformation and never updated.
  • The plan cache is updated once a data frame of the relation is created, which is when a logical plan has been analyzed.
  • The expected number of accesses to the plan cache is the same.

Additionally, this PR includes,

  • Plan cache code refactoring: see the next section for the common coding pattern.
  • Add a plan freshness check method to SessionHolder for future usage.

Why are the changes needed?

The session local plan cache may store unanalyzed logical plans, which causes them to be repeatedly analyzed, resulting in performance degradation.

To be specific, the session local plan cache usage pattern is,

val plan = planner.transformRelation(request.getPlan.getRoot, cachePlan = true)
val dataframe = Dataset.ofRows(session, plan) 

Where an unanalyzed plan may be cached during transformation, and Dataset.ofRows analyzes the unanalyzed plan. This leads to a situation where the same query may hit the cache, returning an unanalyzed plan, and Dataset.ofRows again analyzes the same plan.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

testOnly org.apache.spark.sql.connect.service.SparkConnectSessionHolderSuite

Was this patch authored or co-authored using generative AI tooling?

No.

ghost avatar Jan 21 '25 10:01 ghost

@xi-db FYI @hvanhovell Hi, can you review this PR when you have time?

  • Plans are not cached during transformation.
  • Plans are cached once after data frames are created (after analysis).
  • The number of cache get/put calls stays the same; only the timing was adjusted. Thanks!

ghost avatar Jan 24 '25 12:01 ghost

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Sep 05 '25 00:09 github-actions[bot]