spark [WIP][SPARK-24815] [CORE] Trigger Interval based DRA for Structured Streaming

What changes were proposed in this pull request?

Initial Implementation to DRA changes to work for structured streaming applications
Design doc: https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit?usp=sharing

Why are the changes needed?

to enable auto-scaling of structured streaming applications based on the heuristics of trigger interval. This helps in better resource utilization of the cluster resources and cost savings.

Does this PR introduce any user-facing change?

This PR introduces 3 new spark configurations, change in behavior of 2 configurations based on the newly introduced configurations. more details in design doc.
The changes are made on current master

How was this patch tested?

works as expected
tests TBA

Aug 04 '23 23:08 pkotikalapudi

The design doc mentions this is an SPIP, but I dont recall a discussion of this in dev list - am I missing something or is this a POC draft for soliciting opinions ?

Aug 05 '23 05:08 mridulm

The design doc mentions this is an SPIP, but I dont recall a discussion of this in dev list - am I missing something or is this a POC draft for soliciting opinions ?

There was no prior discussion, This is a POC. My apologies, I was referring to this link while creating the design doc and thought it would be good enough to be a SPIP. Please do suggest if it should be removed, will do that.

Aug 05 '23 19:08 pkotikalapudi

The doc mentioned it is a SPIP, but I did not recall a discussion about it, hence my query :-) You can drop the proposal in dev list for discussion - will ensure better visibility to the proposal as well, while soliciting feedback in parallel.

Aug 06 '23 03:08 mridulm

The doc mentioned it is a SPIP, but I did not recall a discussion about it, hence my query :-) You can drop the proposal in dev list for discussion - will ensure better visibility to the proposal as well, while soliciting feedback in parallel.

I have sent an email a couple of days ago to [email protected]. Hoping to hear from the community 🤞

Aug 07 '23 05:08 pkotikalapudi

We have a few projects on our team that would greatly benefit from this change. Watching to see when it gets over the line.

Sep 11 '23 18:09 krymitch

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Dec 21 '23 00:12 github-actions[bot]

I am also very interested in this code making it into spark. We run a lot of structured streaming and have had to resort to static sized cluster due to the lack of proper scaling code for structured streaming

@PavanKotikalapudi what was the ongoing issue with this? No engagement to your email sent to the community?

Jan 16 '24 05:01 mentasm

We have also encountered this issue.!!

Jan 16 '24 05:01 pky-c

@mentasm , yeah we have an email thread going for a long time. Mich was gracious enough to do a review, but suggested for engagement/feedback from other members of the community who has expertise in dynamic resource allocation.

@mridulm, @vitgorbunov , @krymitch , @mentasm , @pky-c any additional feedback/support in the email thread should help bump this up for the spark dev community to consider

Jan 16 '24 15:01 pkotikalapudi

Hi @mridulm. We appreciate you support on this. DRA is essential to auto scaling up and back down. Can you please confirm if this proposal was ever dropped in the dev list for discussion - to ensure better visibility to the proposal and soliciting feedback in parallel. Is that what this thread was meant to support?

Please let us know what the best steps are to help get this work prioritized. Thank you!

CC: @mridulm, @vitgorbunov , @krymitch , @mentasm , @pky-c, @pkotikalapudi

Jan 16 '24 16:01 krymitch

I will happily add my experiences to that email thread as a consumer of SSS if I can work out how. Our prod environment runs around 40 SSS apps and traffic is determined by banking transactions so it is varied throughout the day, we would see significant savings if we could scale based on load. As it is we have to run 24x7 at resource levels that can deal with peak. As the batch centric scaling that currently exists within spark does not provide any benefit to us, except in non-prod environments where traffic can be zero, we have actually had to completely disable DRA.

In my opinion getting something in place for SSS, even if rudimentary in nature would be a huge step forward for consumers of SSS.

On Wed, 17 Jan 2024, 3:23 am Krystal Mitchell, @.***> wrote:

Hi @mridulm https://github.com/mridulm. We appreciate you support on this. DRA is essential to auto scaling up and back down. Can you please confirm if this proposal was ever dropped in the dev list for discussion - to ensure better visibility to the proposal and soliciting feedback in parallel. Is that what this thread https://lists.apache.org/thread/9yx0jnk9h1234joymwlzfx2gh2m8b9bo was meant to support?

Please let us know what the best steps are to help get this work prioritized. Thank you!

CC: @mridulm https://github.com/mridulm, @vitgorbunov https://github.com/vitgorbunov , @krymitch https://github.com/krymitch , @mentasm https://github.com/mentasm , @pky-c https://github.com/pky-c, @pkotikalapudi https://github.com/pkotikalapudi

— Reply to this email directly, view it on GitHub https://github.com/apache/spark/pull/42352#issuecomment-1894134932, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADX7NTBZSFWLXMTJM56NFGDYO2WBHAVCNFSM6AAAAAA3EZQX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGEZTIOJTGI . You are receiving this because you were mentioned.Message ID: @.***>

Jan 16 '24 21:01 mentasm

@pkotikalapudi For me the google docs link to the design doc no longer works

Jan 17 '24 04:01 mentasm

what kind of error do you get? I had people comment on it in the past, are you behind any vpn?

or if you are ok sharing your email, I can send an invite to the document

Jan 17 '24 05:01 pkotikalapudi

Dynamic Resource Allocation for Structured streaming.pdf

I have attached a pdf version of it for viewing, but we still need to get you access to that doc to comment/collaborate

Jan 17 '24 06:01 pkotikalapudi

It may be my corporate proxy blocking google docs. Thanks for the pdf link. It did work for me the first time I tried in 2023

Jan 17 '24 07:01 mentasm

@pkotikalapudi please share new voting thread here or in old thread. A few of us over at Adobe would like to add our vote, since this work will support a few projects we are currently using and with some future projects. Thanks! And, thanks for all the support.

Jan 19 '24 21:01 krymitch

Thanks for the support @krymitch , here is the voting email thread.

I will ask my team to do the same as well.

Jan 20 '24 08:01 pkotikalapudi

I don't know how to vote on email thread. Last time I've send the message from my mail client and it wasn't displayed on web. I think it's fair to count thumb ups on this PR as votes as well if it matters.

Any news on this issue?

Feb 20 '24 19:02 vitgorbunov

@vitgorbunov Agreed. Same here. Looks like there are 7 thumbs up for this PR. Hopefully thumbs up on this PR count since many may not have official Apache login or may not have the time to send the message via mail client. Thank you.

Feb 20 '24 19:02 krymitch

we need a PMC member to shepherd/review/merge the effort. I didn't see any response from spark PMC members yet.

I don't know how to vote on email thread. Last time I've send the message from my mail client and it wasn't displayed on web.

another way you can try is to subscribe to dev mailing list (subscribe) and then vote

Feb 20 '24 19:02 pkotikalapudi

Thanks for subscribing and voting Krystal. Please request engineers in adobe to vote for this in the same manner. I will bump up the voting thread again to see if PMC has any guidance on the development cycle.

Thanks,

Pavan

On Tue, Feb 20, 2024 at 2:40 PM Krystal Mitchell @.***> wrote:

Thanks you @pkotikalapudi https://urldefense.com/v3/__https://github.com/pkotikalapudi__;!!NCc8flgU!b71KAJ1qaOwyExkLzwSTqu8-YonzWbcyZWqPQoYXDGFDsyBfmcpTVSZ_LqfJOh2XTnr2QNBDYL96IBVx6c5hLu0wv2ND$. I subscribed, and then messaged dev-thread.35716 and it returned all of our comments, only it is not showing my comment in the UI inline on the same thread. That may be by design. Please let me know if i'm missing something. Thank you. I've got a handful of engineers here at Adobe that would love to vote for this.

Vote on Dynamic resource allocation for structured streaming [SPARK-24815] 35715 by: Mich Talebzadeh 35716 by: Pavan Kotikalapudi 35717 by: Adam Hobbs 35718 by: Pavan Kotikalapudi 35719 by: Mich Talebzadeh 35812 by: Krystal Mitchell

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352*issuecomment-1955238640__;Iw!!NCc8flgU!b71KAJ1qaOwyExkLzwSTqu8-YonzWbcyZWqPQoYXDGFDsyBfmcpTVSZ_LqfJOh2XTnr2QNBDYL96IBVx6c5hLj9fRJ6L$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/A3P7GIWBA5QUI4XBZ3BRCXLYUUQ4RAVCNFSM6AAAAAA3EZQX6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJVGIZTQNRUGA__;!!NCc8flgU!b71KAJ1qaOwyExkLzwSTqu8-YonzWbcyZWqPQoYXDGFDsyBfmcpTVSZ_LqfJOh2XTnr2QNBDYL96IBVx6c5hLlRAISPq$ . You are receiving this because you were mentioned.Message ID: @.***>

Feb 22 '24 06:02 pkotikalapudi

If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@rdblue can you please re-open the PR and remove the stale tag. I think github bot will auto-close it if the tag exists.

Thank you

Feb 27 '24 02:02 pkotikalapudi

@jerrypeng, @HeartSaVioR, I have made changes so that the DRA can support multi queries (ref).

I have tested multiple queries in the same driver, batch + streaming queries in the same driver. The apps are scaling according to the DRA configs . Please review.

Thanq

May 13 '24 00:05 pkotikalapudi

hey guys, which version will this land in?

May 26 '24 10:05 stym06

hey guys, which version will this land in?

We have to get reviews and approvals from PMC members and our Sheperd (@HeartSaVioR ) before set a timeline on when it can be released.

May 28 '24 18:05 pkotikalapudi

+1 @pkotikalapudi What is the status of this PR and will it make it into Spark 4.0 GA release? Great initiative - I am very surprised Spark does not have an adaptation of DRA to Structured Streaming yet - a very common use case!

Jul 07 '24 13:07 sauletawil

+1 @pkotikalapudi What is the status of this PR and will it make it into Spark 4.0 GA release? Great initiative - I am very surprised Spark does not have an adaptation of DRA to Structured Streaming yet - a very common use case!

@sauletawil. We still didn't get any approvals/comments from the PMC members and shepherd. This is the last of the communication about this feature https://lists.apache.org/thread/wpvtvf4w3zygtkfgq4sthbf00y5pqxvr.

Maybe subscribe and enquire in the dev mailing list if that helps in moving this feature forward.

Thanq

Jul 08 '24 12:07 pkotikalapudi

We are also waiting for this feature and hope to see it in Spark 4.x

Jul 11 '24 00:07 pky-c

Sad to see this stalled so badly. Looking forward to this functionality

Sep 27 '24 01:09 mentasm

spark spark copied to clipboard

[WIP][SPARK-24815] [CORE] Trigger Interval based DRA for Structured Streaming

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard