spark [SPARK-54596][CORE][K8S] Burst-aware Memory Allocation Algorithm for Spark@K8S

What changes were proposed in this pull request?

Intro

This PR represents Pinterest's work to boost Spark cluster efficiency. A novel burst-aware memory allocation algorithm, Canon, that partitions part of the cluster memory into fixed and burst segments is proposed in this PR. This approach allows the burst segments to be shared among different pods, improving overall memory utilization.

this PR implements Canon: burst aware memory allocation algorithm for memoryOverhead in Spark. The basic idea is that, given the usage of memoryOverhead is pretty bursty, we can separate memoryOverhead into two parts, fixed part (F) and shard part (S). by using K8S request/limit concept, executor pod memory equals to heap size (H) + F and limit is H + F + S

to calculate F and S, we introduced spark.executor.memoryOverheadBurstyFactor (f) as the control factor, assuming users specified spark.executor.memoryOverhead as O

then

F = O - min{(H + O) * (f - 1), O}

users can use spark.executor.memoryOverheadBursty.enabled to control whether enabling this functionality and use spark.executor.memoryOverheadBurstyFactor to control how aggressive we want to share part of memoryOverhead among different pods.

The effectiveness of this algorithm has been validated through production tests at Pinterest.

Acknowledgement

This code in this PR is mainly implemented by Nan Zhu(@CodingCat) while he was working at Pinterest. The algorithm itself is based on https://www.vldb.org/pvldb/vol17/p3759-shi.pdf

SPIP:

https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0

Why are the changes needed?

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT Production tests at Pinterest

Was this patch authored or co-authored using generative AI tooling?

No

Nov 24 '25 02:11 YaoRazor

thank you @YaoRazor for open sourcing it, we have deployed Canon to 1000s of machines in PINS and hopefully it will benefit broad community as well

and , most importantly, really appreciate the innovation from the bytedance team ... this algorithm is implemented based on their paper of https://www.vldb.org/pvldb/vol17/p3759-shi.pdf

@YaoRazor would you mind marking this PR as ready to review?

Hi, @sunchao , as we have discussed offline, would you mind giving it a review?

Nov 24 '25 20:11 CodingCat

Oh this is interesting :)

Nov 24 '25 21:11 holdenk

So this probably requires an SPIP

Nov 24 '25 23:11 holdenk

Yea, I think it'll be useful to have a lightweight SPIP for this feature. In particular we can share experiences of running this in prod at Pinterest, motivations, etc. The SPIP will help to get more attention from the community too, as PRs get ignored easily.

Dec 01 '25 18:12 sunchao

thank you @holdenk and @sunchao , we will prepare and share a SPIP soon

Dec 01 '25 18:12 CodingCat

Hi, @holdenk, @sunchao and @mridulm , we have prepared SPIP doc at https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.1gf0bimgty0t, thank you again for the early feedbacks !

Dec 08 '25 04:12 CodingCat

Awesome, I’m visiting family this week but I’ll try and take a look.

Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ https://www.fighthealthinsurance.com/?q=hk_email Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her

On Sun, Dec 7, 2025 at 11:56 PM Nan Zhu @.***> wrote:

CodingCat left a comment (apache/spark#53190) https://github.com/apache/spark/pull/53190#issuecomment-3624757966

Hi, @holdenk https://github.com/holdenk, @sunchao https://github.com/sunchao and @mridulm https://github.com/mridulm , we have prepared SPIP doc at https://docs.google.com/document/d/1v5PQel1ygVayBFS8rdtzIH8l1el6H1TDjULD3EyBeIc/edit?tab=t.0#heading=h.1gf0bimgty0t, thank you again for the early feedbacks !

— Reply to this email directly, view it on GitHub https://github.com/apache/spark/pull/53190#issuecomment-3624757966, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAOT5IH2CMMAUE5AC2KJC34AUAIFAVCNFSM6AAAAACM7I46YCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMRUG42TOOJWGY . You are receiving this because you were mentioned.Message ID: @.***>

Dec 08 '25 05:12 holdenk