[SPARK-54596][CORE][K8S] Burst-aware Memory Allocation Algorithm for Spark@K8S
What changes were proposed in this pull request?
Intro
This PR represents Pinterest's work to boost Spark cluster efficiency. A novel burst-aware memory allocation algorithm, Canon, that partitions part of the cluster memory into fixed and burst segments is proposed in this PR. This approach allows the burst segments to be shared among different pods, improving overall memory utilization.
this PR implements Canon: burst aware memory allocation algorithm for memoryOverhead in Spark. The basic idea is that, given the usage of memoryOverhead is pretty bursty, we can separate memoryOverhead into two parts, fixed part (F) and shard part (S). by using K8S request/limit concept, executor pod memory equals to heap size (H) + F and limit is H + F + S
to calculate F and S, we introduced spark.executor.memoryOverheadBurstyFactor (f) as the control factor, assuming users specified spark.executor.memoryOverhead as O
then
F = O - min{(H + O) * (f - 1), O}
users can use spark.executor.memoryOverheadBursty.enabled to control whether enabling this functionality and use spark.executor.memoryOverheadBurstyFactor to control how aggressive we want to share part of memoryOverhead among different pods.
The effectiveness of this algorithm has been validated through production tests at Pinterest.
Acknowledgement
This code in this PR is mainly implemented by Nan Zhu(@CodingCat) while he was working at Pinterest. The algorithm itself is based on https://www.vldb.org/pvldb/vol17/p3759-shi.pdf
SPIP:
https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0
Why are the changes needed?
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT Production tests at Pinterest
Was this patch authored or co-authored using generative AI tooling?
No
thank you @YaoRazor for open sourcing it, we have deployed Canon to 1000s of machines in PINS and hopefully it will benefit broad community as well
and , most importantly, really appreciate the innovation from the bytedance team ... this algorithm is implemented based on their paper of https://www.vldb.org/pvldb/vol17/p3759-shi.pdf
@YaoRazor would you mind marking this PR as ready to review?
Hi, @sunchao , as we have discussed offline, would you mind giving it a review?
Oh this is interesting :)
So this probably requires an SPIP
Yea, I think it'll be useful to have a lightweight SPIP for this feature. In particular we can share experiences of running this in prod at Pinterest, motivations, etc. The SPIP will help to get more attention from the community too, as PRs get ignored easily.
thank you @holdenk and @sunchao , we will prepare and share a SPIP soon
Hi, @holdenk, @sunchao and @mridulm , we have prepared SPIP doc at https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.1gf0bimgty0t, thank you again for the early feedbacks !
Awesome, I’m visiting family this week but I’ll try and take a look.
Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ https://www.fighthealthinsurance.com/?q=hk_email Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her
On Sun, Dec 7, 2025 at 11:56 PM Nan Zhu @.***> wrote:
CodingCat left a comment (apache/spark#53190) https://github.com/apache/spark/pull/53190#issuecomment-3624757966
Hi, @holdenk https://github.com/holdenk, @sunchao https://github.com/sunchao and @mridulm https://github.com/mridulm , we have prepared SPIP doc at https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.1gf0bimgty0t, thank you again for the early feedbacks !
— Reply to this email directly, view it on GitHub https://github.com/apache/spark/pull/53190#issuecomment-3624757966, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAOT5IH2CMMAUE5AC2KJC34AUAIFAVCNFSM6AAAAACM7I46YCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMRUG42TOOJWGY . You are receiving this because you were mentioned.Message ID: @.***>